Threads, Blocks and Grids
From
NVIDIA Optimizing CUDA“Experiment” mode discovers and saves optimal configuration
Blocks per Grid Heuristics
- Number of blocks should be more than number of multiprocessors, so all multiprocessors have at least one block to execute.
- Number of blocks divided by number of multiprocessors should be more than 2
- Multiple blocks can run concurrently in a multiprocessor
- Blocks that aren’t waiting at a __syncthreads() keep the hardware busy
- Subject to resource availability –registers, shared memory
- Number of blocks that is more than 100 will scale to future devices
Optimizing threads per block
- Choose threads per block as a multiple of warp size to avoid wasting computation on under-populated warps and to facilitate coalescing
- Run as many warps as possible per multiprocessor (hide latency)
- Multiprocessor can run up to 8 blocks at a time
- Heuristics
- Minimum: 64 threads per block (Only if multiple concurrent blocks)
- 192 or 256 threads a better choice (Usually still enough regs to compile and invoke successfully)
- However, this all depends on your computation, so experiment!
No comments:
Post a Comment