Wednesday, July 29, 2009

Threads, Blocks and Grids

From NVIDIA Optimizing CUDA
“Experiment” mode discovers and saves optimal configuration


Blocks per Grid Heuristics
  • Number of blocks should be more than number of multiprocessors, so all multiprocessors have at least one block to execute.
  • Number of blocks divided by number of multiprocessors should be more than 2
    • Multiple blocks can run concurrently in a multiprocessor
    • Blocks that aren’t waiting at a __syncthreads() keep the hardware busy
    • Subject to resource availability –registers, shared memory
  • Number of blocks that is more than 100 will scale to future devices


Optimizing threads per block
  • Choose threads per block as a multiple of warp size to avoid wasting computation on under-populated warps and to facilitate coalescing
  • Run as many warps as possible per multiprocessor (hide latency)
  • Multiprocessor can run up to 8 blocks at a time
  • Heuristics
    • Minimum: 64 threads per block (Only if multiple concurrent blocks)
    • 192 or 256 threads a better choice (Usually still enough regs to compile and invoke successfully)
    • However, this all depends on your computation, so experiment!

Zero Copy (Mapped Memory): Directly Access Host Memory from The Device

Came across with this Zero Copy which has just been introduced in CUDA 2.2 when I was looking for the concept of threads and blocks.

Quoted from NVIDIA Optimizing CUDA


  • Access host memory directly from device code
  • Transfers implicitly performed as needed by device code
  • All set-up is done on host using mapped memory

What should be considered when using Zero Copy
  • Zero copy will always be a win for integrated devices that utilize CPU memory (check this using the integratedfield in cudaDeviceProp)
  • Zero copy will be faster if data is only read/written from/to global memory once: Copy input to GPU, One kernel run, Copy output to CPU
  • Potentially easier and faster alternative to using cudaMemcpyAsync
  • Current devices use pointers that are 32-bit so there is a limit of 4GB per context



Quoted from section 3.2.5.3 Mapped Memory in CUDA Programming Guide 2.2

On some devices, a block of page-locked host memory can also be mapped into the device’s address space by passing flag cudaHostAllocMapped to cudaHostAlloc(). Such a block has therefore two addresses: one in host memory and one in device memory. The host memory pointer is returned by cudaHostAlloc() and the device memory pointer can be retrieved using cudaHostGetDevicePointer() and used to access the block from within a
kernel.

Accessing host memory directly from within a kernel has several advantages:
  • There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;
  • There is no need to use streams (see Section 3.2.6.1) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events (see Section 3.2.6) to avoid any potential read-after-write, write-after-read, or write-after-
write hazards.

A block of page-locked host memory can be allocated as both mapped and portable (see Section 3.2.5.1), in which case each host thread that needs to map the block to its device address space must call cudaHostGetDevicePointer() to retrieve a device pointer, as device pointers will generally differ from one host thread to the other.

To be able to retrieve the device pointer to any mapped page-locked memory within a given host thread, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA operations is performed by the thread. Otherwise, cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory.

Applications may query whether a device supports mapped page-locked host memory or not by calling cudaGetDeviceProperties() and checking the canMapHostMemory property.



CUDA Wrapping

Quoted from Section 4.1 of CUDA Programming Guide 2.2.1: A Set of SIMT Multiprocessors with On-Chip Shared Memory


The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity as illustrated in Figure 4-1. The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor consists of eight Scalar Processor (SP) cores, two special function units for transcendentals, a multithreaded instruction unit, and on-chip shared memory. The multiprocessor creates, manages, and executes concurrent threads in hardware with zero scheduling overhead. It implements the __syncthreads() barrier synchronization intrinsic with a single instruction. Fast barrier synchronization together with lightweight thread creation and zero-overhead thread scheduling efficiently support very fine-grained parallelism, allowing, for example, a low granularity decomposition of problems by assigning one thread to each data element (such as a pixel in an image, a voxel in a volume, a cell in a grid-based computation).

To manage hundreds of threads running several different programs, the multiprocessor employs a new architecture we call SIMT (single-instruction, multiple-thread). The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state. The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. (This term originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp.) Individual threads composing a SIMT warp start together at the same program address but are otherwise free to branch and execute independently.

When a multiprocessor is given one or more thread blocks to execute, it splits them into warps that get scheduled by the SIMT unit. The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2 describes how thread IDs relate to thread indices in the block.

Every instruction issue time, the SIMT unit selects a warp that is ready to execute and issues the next instruction to the active threads of the warp. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.

SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually.



Not quite understood but was trying to read more.


Monday, July 20, 2009

Out to Swim (Again)

I intended to go swimming this afternoon... Really intended!

เมื่อตอนกลางวัน เราตั้งใจจะไปว่ายน้ำ ตั้งใจมากๆ

I left my house at nearly four, felt good that the sunlight was not so strong like the other days. I was not sure how long I had been swimming. Perhaps about half an hour. Then the rain came. Though it was an indoor swimming pool, the staff called everyone to get out of the pool. Do not ask me why. I do not know.

เราออกจากบ้านเมื่อตอนเกือบสี่โมง ดีที่วันนี้แดดไม่จัดมาก ไปถึงก็ว่ายน้ำ ไม่แน่ใจว่าได้ว่ายอยู่นานเท่าไหร่ ประมาณครึ่งชั่วโมงได้มั้ง แล้วฝนก็เทลงมา จริงๆ สระมันอยู่ในตึก แต่คนดูแลก็เรียกทุกคนขึ้นจากสระ อย่าถามว่าทำไม หนูไม่ยู้

At least I got some exercises after all. I walked to the pool, swam, and walked back. hehe

เอาน่ะ อย่างน้อยหนูก็ได้ออกกำลังกายแหละ เริ่มจากเดินไปสระ ว่ายน้ำ แล้วก็เดินกลับ อิอิ


Wednesday, July 15, 2009

Summer Class

It has been three days since the class began. I have class everyday this quarter and there are only six weeks to the end of quarter. The weather is so hot that I do not want to step outside. :(

These days I have been addicted to comics that Pao (my senior) has introduced me. It is bad. I should have been focusing on studying and my research.


Wednesday, July 08, 2009

At McDonald's

Today, I had lunch at McDonald's. It was the first meal that I ate out since I got back to Ruston. It was because I wanted somewhere to hide from the sun. I had and appointment with the landlady of my new apartment at 10 AM. So I left my house around 9:10, walked, and got there around 9:40. It was so hot because of the sunlight. After having a talk with the landlady I walked back. The landlady actually offered me a ride but I would like to stop by somewhere (which i realized that it was closed on Wednesdays afterward :( ).

วันนี้ไปกินกลางวันที่แมคโดนัลด์มา เป็นมื้อแรกที่ออกไปกินข้างนอกหลังจากกลับมาถึง Ruston ทั้งนี้ทั้งนั้น เพราะไอ้เอ๋อยากหาที่หลบแดด เดินไปอพาร์ตเมนต์ใหม่ตั้งแต่เก้าโมงสิบ เพราะนัดกับเจ้าของอพาร์ตเมนต์ไว้ตอนสิบโมง ไปถึงนั่นเก้าโมงสี่สิบ อากาศโคตรร้อน แดดแรงมาก พอคุยกับเจ้าของเสร็จแล้วก็เดินกลับ จริงๆ เจ้าของบ้านก็บอกว่าจะมาส่ง แต่ไอ้เอ๋จะแวะซื้อของก่อน (ซึ่งมารู้ที่หลังว่ามันปิดวันพุธ :( เซ็งเป็ด! )

I stopped by County Market around 10:30, grabbed some bread, instant noodles, etc. Then I crossed the street, walked into McDonald's and made an order, paid for it and waited. I waited in front of the counter for couple of minutes with no one got me my meal. :(

ทีนี้ก็เลยแวะ County Market ตอนประมาณสิบโมงครึ่ง ซึ้อขนมปังกับบะหมี่กึ่งสำเร็จรูป แล้วก็อย่างอื่นอีกหน่อย แล้วก็เลยข้ามถนนไปแมค สั่งๆ จ่ายๆ แล้วยืนรอ รอเงกอยู่หน้าเคาน์เตอร์ตั้งนานก็ยังไม่มีใครเอาแฮมเบอร์เกอร์ตูมาให้

However, there at last was a black girl realized that I was waiting for something. She asked me and I told her I was waiting for my burger and fries. She got me my cheeseburger, and another black girl told me to wait for the fries. The were not done yet. After couple of minutes, and the fries were getting done, the first black girl asked me again because she saw me still waiting there.

และแล้วก็มีสาวผิวหมึกคนหนึ่งเห็นไอ้เอ๋รออะไรอยู่ พี่แกเลยเข้ามาถาม ไอ้เอ๋ก็เลยบอกไปว่ารอเบอร์เกอร์กับเฟรนช์ฟรายของตูอยู่ พี่แกก็หยิบชีสเบอร์เกอร์มาให้ สาวหมึกอีกคนก็บอกให้รอเฟรนช์ฟรายก่อน กำลังทอดอยู่ ไม่กี่นาทีต่อมา เฟรนช์ฟรายเสร็จแล้ว สาวผิวหมึกคนแรกก็มาถามต่อ ว่าตูรออะไรอีก

"My fries..." I told her.

"เฟรนช์ฟรายหนูล่ะ..."

I finally got my fries with bunch of ketchup packets. :)

ไอ้เอ๋ได้เฟรนช์ฟรายในที่สุด พร้อมกับห่อซอสมะเขืออีกกำนึง :)


Monday, July 06, 2009

Let's Jump into the Water!

I just got back from the swimming pool. It was good to get some exercises but I was so tired. Tech's swimming pool is ok, not bad but not too good. At least, it is bigger than the ones in my village back in Thailand.

เพิ่งไปว่ายน้ำมาแหละ ได้ออกกำลังบ้างก็ดีเหมือนกัน แต่โคตรเหนื่อยเลย สระว่ายน้ำที่นี่ก็ใช้ได้ ไม่ได้แย่ แต่ก็ไม่ได้ดีมาก อย่างน้อยมันก็ใหญ่กว่าสระว่ายน้ำในหมู่บ้านที่เมืองไทยละกัน

My sister had told me to go to swim sometimes. She said that, for me, it is better than other sports since I have varicose vein (blood circulation problem) on my legs. It used to cause me pain when I played badminton or other sports. And I also have allergy. Let us see if I can get rid of it!

โอ๋เคยบอกว่าให้ไปว่ายน้ำบ้าง เพราะเรามีเส้นเลือดขอดที่ขา ทำให้ปวดเวลาเล่นกีฬาอย่างอื่น แล้วก็ยังเป็นภูมิแพ้อีก ก็ดูแล้วกันว่าจะหายมั้ย


Small Party at Gan't Apartment

There was a small party (only Thais) at Gan's apartment last night. There was nothing but hamburgers and salad. Aig and Mon had prepared the hamburgers for us. They were really nice, wern't they? :) There was some alcoholic drink as well. Mon and I did not touch it for sure.

It was my first time to grill hamburgers as well. (Actually Aid, Mon and Gan did it.) They sprayed some oil, put some salt, black pepper and thyme on the beef loafs. I had no idea that there would be thyme on the hamburgers. Because there was rain all day, it was hard to lit the stove. But Gan did it at last. The hamburgers were good. That might be because we did it ourselves. :)


Wednesday, July 01, 2009

It's SUMMER!

กลับมาได้สามวัน เจออากาศร้อนเหยียบร้อยกันทุกวัน แทบกรี๊ด

ตอนนี้ที่ Ruston อากาศร้อนมาก เปิดแอร์กันทั้งวันทั้งคืน ค่าไฟบาน แล้วนี่มันคงจะร้อนกันไปอีกอย่างน้อยสามเดือน ละลายดีกว่า