Tensorflow new Op CUDA kernel memory management

The is no direct public guideline for this issue. I usually just let the TensorFlow allocate this information by

template<typename Device, typename Dtype>
class MyOp: public OpKernel {
{
public:
  explicit MyOp(OpKernelConstruction *context) :
      OpKernel(context)
  {
    // ...
  }

  void Compute(OpKernelContext *context) override
  {
    Tensor* tmp_var = nullptr;
    Tensor* output = nullptr;

    TensorShape some_shape, some_shape2;

    // temparily use this space
    OP_REQUIRES_OK(ctx, ctx->allocate_temp(DT_FLOAT, some_shape, &tmp_var));
    // allocate memory for output tensor
    OP_REQUIRES_OK(ctx, ctx->allocate_output(0, some_shape2, &output));
  1. whatever needs memory, should be allocated by the TensorFlow context and not by custom cudaMalloc or new type[num] calls.
  2. the context should provide the information for the Allocator
  3. see below

Consider, for the sake of simplicity just adding two matrices (full example). TensorFlow-Operations usually contain the following structure:

Op description having REGISTER_OP, which is responsible for shape-checking, and setting the output shape (example)

OpKernel responsible for allocating memory, getting pointer to the inputs and setup stuff, (see above or this )

Functor for the implementation itself, like

Tensor* output = nullptr;
Tensor* tmp_var = nullptr;
OP_REQUIRES_OK(ctx, ctx->allocate_output(0, output_shape, &output));
OP_REQUIRES_OK(ctx, ctx->allocate_temp(0, some_shape, &tmp_var));
// the function does not need to care about the memory allocation as everything is already setup at this point
::tensorflow::functor::MyFunctor<Device, Dtype>()(ctx, inputA, inputB, tmp_var, output);

You are just left by implementing

    // gpu version
    template <typename Dtype>
    struct MyFunctor<GPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

    // cpu version
    template <typename Dtype>
    struct MyFunctor<CPUDevice, Dtype> {
      void operator ()(::tensorflow::OpKernelContext* ctx,...)

edit

  • allocate_persistent: use this if you need your data between Op invocations like one-time index structures.[example]
  • allocate_temp just tmp memory which will be not retained at the end of the Compute method lifetime. [example]

But I highly recommend reading the comment in the source-code here and then decided depending on your use case.


The best practice is to use the OpKernelContext::allocate_persistent() method to allocate memory, in the form of a tensorflow::Tensor, that outlives a single call to OpKernel::Compute(). It uses the appropriate Allocator* for the device, so if the kernel runs on a GPU device, it will allocate GPU memory for that particular device, and if it runs on a CPU device it will allocate CPU memory.

Tags:

Gpu

Tensorflow