Wednesday, July 29, 2009

Zero Copy (Mapped Memory): Directly Access Host Memory from The Device

Came across with this Zero Copy which has just been introduced in CUDA 2.2 when I was looking for the concept of threads and blocks.

Quoted from NVIDIA Optimizing CUDA


  • Access host memory directly from device code
  • Transfers implicitly performed as needed by device code
  • All set-up is done on host using mapped memory

What should be considered when using Zero Copy
  • Zero copy will always be a win for integrated devices that utilize CPU memory (check this using the integratedfield in cudaDeviceProp)
  • Zero copy will be faster if data is only read/written from/to global memory once: Copy input to GPU, One kernel run, Copy output to CPU
  • Potentially easier and faster alternative to using cudaMemcpyAsync
  • Current devices use pointers that are 32-bit so there is a limit of 4GB per context



Quoted from section 3.2.5.3 Mapped Memory in CUDA Programming Guide 2.2

On some devices, a block of page-locked host memory can also be mapped into the device’s address space by passing flag cudaHostAllocMapped to cudaHostAlloc(). Such a block has therefore two addresses: one in host memory and one in device memory. The host memory pointer is returned by cudaHostAlloc() and the device memory pointer can be retrieved using cudaHostGetDevicePointer() and used to access the block from within a
kernel.

Accessing host memory directly from within a kernel has several advantages:
  • There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;
  • There is no need to use streams (see Section 3.2.6.1) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events (see Section 3.2.6) to avoid any potential read-after-write, write-after-read, or write-after-
write hazards.

A block of page-locked host memory can be allocated as both mapped and portable (see Section 3.2.5.1), in which case each host thread that needs to map the block to its device address space must call cudaHostGetDevicePointer() to retrieve a device pointer, as device pointers will generally differ from one host thread to the other.

To be able to retrieve the device pointer to any mapped page-locked memory within a given host thread, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA operations is performed by the thread. Otherwise, cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory.

Applications may query whether a device supports mapped page-locked host memory or not by calling cudaGetDeviceProperties() and checking the canMapHostMemory property.



No comments:

Post a Comment