Wednesday, July 29, 2009

Threads, Blocks and Grids

From NVIDIA Optimizing CUDA
“Experiment” mode discovers and saves optimal configuration


Blocks per Grid Heuristics
  • Number of blocks should be more than number of multiprocessors, so all multiprocessors have at least one block to execute.
  • Number of blocks divided by number of multiprocessors should be more than 2
    • Multiple blocks can run concurrently in a multiprocessor
    • Blocks that aren’t waiting at a __syncthreads() keep the hardware busy
    • Subject to resource availability –registers, shared memory
  • Number of blocks that is more than 100 will scale to future devices


Optimizing threads per block
  • Choose threads per block as a multiple of warp size to avoid wasting computation on under-populated warps and to facilitate coalescing
  • Run as many warps as possible per multiprocessor (hide latency)
  • Multiprocessor can run up to 8 blocks at a time
  • Heuristics
    • Minimum: 64 threads per block (Only if multiple concurrent blocks)
    • 192 or 256 threads a better choice (Usually still enough regs to compile and invoke successfully)
    • However, this all depends on your computation, so experiment!

No comments:

Post a Comment