撰文|郑建华
更新|赵露阳

tensor和op是神经网络模型最根本的组件:op是模型的节点,tensor是衔接节点的边。然而,构建一个tensor并不仅仅是结构一个目标那么简单,至少要考虑以下问题:

  • 要支撑节点本地的local tensor,以及分布式的global tensor;
  • 要支撑eager和lazy履行形式;
  • 要支撑不同的数据类型,包括float、double、int等;
  • 要支撑不同设备。

1

创立tensor的办法

与PyTorch相似,在OneFlow中也能够经过两种首要的办法来创立tensor:Tensor和tensor。这两种办法终究都会创立出OneFlow内部的C++ Tensor目标,即对应Python层的flow.Tensor类型。

1.1 Tensor

Python层的Tensor是在tensor.py(github.com/Oneflow-Inc… )中引入的,经过python c api注册的Tensor类型目标,此目标在MakeTensorType(github.com/Oneflow-Inc… )中被界说和回来。

在MakeTensorType中首要经过PyTensorObject_init创立了Tensor目标:

static int PyTensorObject_init(PyObject* self, PyObject* args, PyObject* kwargs) {
  HANDLE_ERRORS
  auto* temp = functional::_legacy_tensor_ctor(NULL, args, kwargs);
  if (PyErr_Occurred()) { throw py::error_already_set(); }
  auto* _self = (PyTensorObject*)self;
  _self->data = PyTensor_Unpack(temp);
  _self->data->set_pyobject(self);
  // reset temp data to prevent clearing the pyobject
  // when the temp is deallocated
  ((PyTensorObject*)temp)->data.reset();
  Py_XDECREF(temp);
  return 0;
  END_HANDLE_ERRORS_RET(-1)
}

经过
functional::_legacy_tensor_ctor函数创立了OneFlow内部的c++ Tensor目标:oneflow::one::Tensor,并作为data绑定至Python的Tensor类型。在MakeTensorType中,还经过PyMethodDef(github.com/Oneflow-Inc… )为Tensor注册了许多C++办法,如:

static PyMethodDef PyTensorObject_methods[] = {
    {"storage_offset", PyTensorObject_storage_offset, METH_NOARGS, NULL},
    {"stride", PyTensorObject_stride, METH_NOARGS, NULL},
    {"is_contiguous", PyTensorObject_is_contiguous, METH_NOARGS, NULL},
    {"contiguous", PyTensorObject_contiguous, METH_NOARGS, NULL},
    {"contiguous_", PyTensorObject_contiguous_, METH_NOARGS, NULL},
    {"pin_memory", PyTensorObject_pin_memory, METH_NOARGS, NULL},
    {"is_pinned", PyTensorObject_is_pinned, METH_NOARGS, NULL},
    {"requires_grad_", (PyCFunction)PyTensorObject_requires_grad_, METH_VARARGS | METH_KEYWORDS,
     NULL},
    {"retain_grad", PyTensorObject_retain_grad, METH_NOARGS, NULL},
    {"detach", PyTensorObject_detach, METH_NOARGS, NULL},
    {"clone", PyTensorObject_clone, METH_NOARGS, NULL},
    {"zero_", PyTensorObject_zero_, METH_NOARGS, NULL},
    {"register_hook", PyTensorObject_register_hook, METH_O, NULL},
    {"_register_post_grad_accumulation_hook", PyTensorObject__register_post_grad_accumulation_hook,
     METH_O, NULL},
    {"global_id", PyTensorObject_global_id, METH_NOARGS, NULL},
    {"check_meta_consistency", PyTensorObject_check_meta_consistency, METH_NOARGS, NULL},
    {"to_numpy", PyTensorObject_to_numpy, METH_NOARGS, NULL},
    {"type", (PyCFunction)PyTensorObject_type, METH_VARARGS | METH_KEYWORDS, NULL},

此外,在Python层经过RegisterMethods(github.com/Oneflow-Inc… )也为Tensor注册了一些Python完结的Tensor办法或特点(如tensor.numpy),在OneFlow包初始化时会经过RegisterMethod4Class(github.com/Oneflow-Inc… )完结这些Python办法和特点的注册。RegisterMethod4Class的调用流程如下:

OneFlow源码解析:Tensor类型体系与Local Tensor

相比于Python完结来说,Tensor的++完结的办法/特点一般具有较高的功能。

1.2 tensor函数

Tensor是类型,而tensor则是函数,flow.tensor函数在
oneflow/api/python/functional/tensor_api.yaml中被界说:

- name: "tensor"
  signature: [
      "Tensor (PyObject* data, *, DataType dtype=None, Device device=None,
      Bool requires_grad=False, Bool pin_memory=False) => TensorWithData",
      "Tensor (PyObject* data, *, DataType dtype=None, Placement placement,
      SbpList sbp, Bool requires_grad=False) => GlobalTensorWithData",
    ]
  bind_python: True

其C++完结坐落
tensor_api.yaml.pybind.cpp中,这是构建阶段自动生成的文件。

经过函数签名能够看到,flow.tensor()有两种重载的办法:

  • TensorWithData
  • GlobalTensorWithData

它们分别用于结构local tensor和global tensor的结构。和上面的Tensor相似,flow.tensor回来的也是OneFlow内部的oneflow::one::Tensor目标(绑定至Python的Tensor目标)。

1.3 手动构建tensor的两种办法

和PyTorch相似,在OneFlow中常用创立tensor的办法也分为两种:

  • flow.Tensor
  • flow.tensor

创立办法示例:

import oneflow
import numpy as np
oneflow.tensor([[1., -1.], [1., -1.]])
# tensor([[ 1., -1.],
#         [ 1., -1.]], dtype=oneflow.float32)
oneflow.tensor(np.array([[1, 2, 3], [4, 5, 6]]))
# tensor([[ 1, 2, 3],
#         [ 4, 5, 6]], dtype=oneflow.int64)
flow.Tensor([[1,2,3],[4,5,6]])

大多数情况下(和PyTorch相似的eager形式),能够经过指定device、dtype、shape等参数创立一般tensor(local tensor);

少量情况下(如OneFlow特有的eager global、lazy形式),需求global tensor时,能够经过指定sbp和placement的办法直接创立global tensor,也可经过tensor.to_global的办法将一般tensor转换为global tensor,可参考:

  • oneflow.tensor

oneflow.readthedocs.io/en/master/g…

  • global tensor

docs.oneflow.org/master/para…

2

OneFlow的tensor类型体系

上述内容中介绍的oneflow内部的C++ Tensor目标,实践上其界说坐落:
oneflow/core/framework/tensor.h,是一个笼统的Tensor类型。

OneFlow源码解析:Tensor类型体系与Local Tensor

其间LocalTensor即为一般的单卡视角下的Tensor(和PyTorch的Tensor相似);GlobalTensor则为OneFlow所特有的大局视角下的Tensor(一般用于eager global形式或lazy形式下)。Tensor使用了Bridge形式,每个Tensor子类内部有一个TensorImpl字段,负责笼统Tensor的实践完结:

OneFlow源码解析:Tensor类型体系与Local Tensor

3

local tensor的结构

咱们以flow.tensor([[1,2,3],[4,5,6]])为例,看一下tensor结构的进程。首要的流程如下:

OneFlow源码解析:Tensor类型体系与Local Tensor

在这个比如中,因为使用的是flow.tensor办法创立tensor(且为一般的local tensor)所以会用到在oneflow/api/python/functional/tensor_api.yaml中界说的TensorWithData办法,其完结,是坐落oneflow/api/python/functional/tensor_api.cpp的TensorWithDataFunctor:

class TensorWithDataFunctor {
 public:
  Maybe<Tensor> operator()(PyObject* data, const Optional<Symbol<DType>>& dtype,
                           const Optional<Symbol<Device>>& device, const bool requires_grad,
                           const bool pin_memory) const {
    ...
    if (PyTensor_Check(data)) {
      // Throw warnings like pytorch.
      auto ret = PyErr_WarnEx(
          PyExc_UserWarning,
          "To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() "
          "or sourceTensor.clone().detach().requires_grad_(True), rather than "
          "oneflow.tensor(sourceTensor).",
          1);
      if (ret != 0) { return Error::RuntimeError(); }
      const auto& other = PyTensor_Unpack(data);
      return MakeTensorFromOtherTensor(other, dtype, device, requires_grad, pin_memory);
    } else {
      // Make tensor from python sequence or numpy array.
      return MakeLocalTensorFromData(data, dtype, device, requires_grad, pin_memory);
    }
  }
};

因为这儿传入的data是一个Python的list目标,所以终究会调用MakeLocalTensorFromData办法,创立tensor首要的逻辑都在这个函数中。其间大量调用Python和Numpy的接口,检查PyObject的数据类型,获取Shape

github.com/Oneflow-Inc… )和DataType(github.com/Oneflow-Inc… ),假如用户没有制定device,默认会设置为CPU设备(github.com/Oneflow-Inc… )。

后边首要是调用EmptyFunctor

github.com/Oneflow-Inc… )和SwitchCopyLocalTensorFromUntypedArray(github.com/Oneflow-Inc… )。前者为tensor分配内存,后者进行数据复制,两个步骤都会经过虚拟机指令完结。其间EmptyFunctor会走一般的OpCall指令、而CopyLocalTensorFromUntypedArray会依据是否需求同步copy走到AccessBlobByCallback/SyncAccessBlobByCallback指令。

为什么要经过虚拟机指令完结呢?无论是内存资源的分配,还是数据复制,CPU和CUDA等不同设备上的操作都不相同。之前评论Op/Kernel时现已看到,在OneFlow中所有动静态图使命履行、eager形式下op/kernel履行、内存/显存的分配和开释、device、stream等统一由虚拟机进行管理。

3.1 分配内存:EmptyFunctor

matmul和relu(inplace=false时)等操作在履行进程中也会创立output tensor。之前评论relu时要点重视了op和kernel的核算逻辑,而忽略了tensor相关的内容。

而这儿只需求先结构一个空tensor目标,不需求其它核算,所以是一个Empty操作,Empty op对应的kernel——EmptyKernel(github.com/Oneflow-Inc… )没有实质性的核算逻辑,仅仅先依据shape、dtype、device信息创立一个空tensor,等候后续将实践的数据从内存中copy至此空tensor,然后完结整个tensor的创立进程。

EmptyFunctor相同和其他functor相同,终究会被Dispacth至对应的interpreter被解说履行,这儿因为是eager形式下的local tensor,EmptyFunctor终究会进入eager local interpreter,交给NaiveInterpret(github.com/Oneflow-Inc… )办法处理。流程如下:

  1. 在结构EagerLocalTensorImpl(*github.com/Oneflow-Inc…

  2. 之后会初始化EagerBlobObject(github.com/Oneflow-Inc… )、TensorStorage(github.com/Oneflow-Inc… ),这样tensor首要的字段根本构建完毕

  3. 然后结构OpCall指令、提交虚拟机PhysicalRun(github.com/Oneflow-Inc… ),等候vm的调度履行。

OpCall对应的指令战略终究会进入oneflow/core/vm/op_call_instruction_policy.cpp,并在Prepare办法中经过AllocateOutputBlobsMemory办法对TensorStorage完结实践的内存分配;在Compute办法中发动(empty op对应的)实践的kernel履行。

3.2 复制数据:SwitchCopyLocalTensorFromUntypedArray

SwitchCopyMirroredTensorFromUntypedArray其实是MAKE_SWITCH_ENTRY(github.com/Oneflow-Inc… )宏打开后的函数名。宏打开后的代码如下。实践会调用CopyLocalTensorFromUntypedArray(github.com/Oneflow-Inc… )。

template<typename... Args>
static Maybe<void> SwitchCopyLocalTensorFromUntypedArray(
    const std::tuple<DataType>& switch_tuple, Args&& ... args) {
  static const std::map<std::tuple<DataType>, std::function<Maybe<void>(Args && ...)>>
      case_handlers {
          {SwitchCase(DataType::kFloat),
           [](Args&&... args) {
             return CopyLocalTensorFromUntypedArray<float>(std::forward<Args>(args)...);
           }},
           // ...
      };
  return case_handlers.at(switch_tuple)(std::forward<Args>(args)...);
};

CopyLocalTensorFromUntypedArray办法如下:

template<typename T>
Maybe<void> CopyLocalTensorFromUntypedArray(const std::shared_ptr<Tensor>& tensor,
                                            PyObject* array) {
  return CopyBetweenLocalTensorAndNumpy<T>(tensor, array, CopyFromNumpyArray, "mut",
                                           /*block_host_until_done=*/false);
}

其内部实践调用了CopyBetweenLocalTensorAndNumpy办法。

CopyBetweenLocalTensorAndNumpy

望文生义,这个办法首要是用在numpy和tensor之间进行数据copy的。其间第3个参数:CopyFromNumpyArray实践是一个函数回调的callback办法,其首要经过SyncAutoMemcpy进行array和tensor(blob)之间的内存复制:

void CopyFromNumpyArray(ep::Stream* stream,
                        const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object,
                        const NumPyArrayPtr& array_ptr) {
  SyncAutoMemcpy(stream, eager_blob_object->mut_dptr(), array_ptr.data(),
                 eager_blob_object->ByteSizeOfBlobBody(), eager_blob_object->mem_case(),
                 memory::MakeHostMemCase());
}

持续看CopyBetweenLocalTensorAndNumpy(github.com/Oneflow-Inc… )办法,其间最关键的是:

   JUST(PhysicalRun([&](InstructionsBuilder* builder) -> Maybe<void> {
      return builder->AccessBlobByCallback(
          tensor,
          [array_ptr, Copy](ep::Stream* stream,
                            const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object) {
            Copy(stream, eager_blob_object, array_ptr);
          },
          modifier);
    }));

经过InstructionsBuilder构建了AccessBlobByCallback指令,参数为上面经过EmptyFuncor创立的空tensor、callback的函数指针及参数、以及modifier(string “mut”表示可动态修正)。

AccessBlobByCallback

和OpCall相似,InstructionsBuilder调用AccessBlobByCallback时,也会实践结构对应的vm指令战略——AccessBlobArgCbInstructionPolicy并派发至vm,等候被调度和实践履行:

template<typename T>
Maybe<void> InstructionsBuilder::AccessBlobByCallback(
    const T tensor,
    const std::function<void(ep::Stream*, const std::shared_ptr<vm::EagerBlobObject>&)>& callback,
    const std::string& modifier) {
  const std::shared_ptr<vm::EagerBlobObject>& eager_blob_object = JUST(tensor->eager_blob_object());
  Symbol<Device> device = JUST(GetDevice(tensor));
  ...
  Symbol<Stream> stream = JUST(GetDefaultStreamByDevice(device));
  JUST(SoftSyncStream({eager_blob_object}, stream));
  auto instruction = intrusive::make_shared<vm::Instruction>(
      // Never replace `stream` with producer_stream or last_used_stream.
      JUST(Singleton<VirtualMachine>::Get()->GetVmStream(stream)),
      std::make_shared<vm::AccessBlobArgCbInstructionPolicy>(eager_blob_object, callback,
                                                             modifier));
  instruction_list_->EmplaceBack(std::move(instruction));
  return Maybe<void>::Ok();
}

等该条AccessBlobArgCbInstructionPolicy指令实践履行时,会在指令的Compute(
github.com/Oneflow-Inc…
)办法中调用callback完结从tensor的blob <-> numpy的ndarray之间的数据copy,至此复制进程完毕,flow.tensor的创立悉数完结。

(本文经授权后发布。原文:segmentfault.com/a/119000004… )

参考资料

  • On‍eFlow源码:github.com/Oneflow-Inc…
  • OneFlow源码解析:Op、Kernel与解说器
  • OneFlow源码解析:算子指令在虚拟机中的履行

欢迎下载体验 OneFlow v0.8.0 最新版本:
github.com/Oneflow-Inc…