A Little Life in the World: Computer and Technology

Showing posts with label Computer and Technology. Show all posts

Sunday, December 20, 2009

CUDA Visual Profiler

It says that:

- Execute a CUDA program with profiling enabled and view the profiler output as a table for each GPU method.
- Display the summary profiler table.
- Display various kinds of plots: GPU Time Height plot, GPU Time Width plot, etc.
- Analysis of profiler output: incoherent stores, incoherent loads, warp serializations
- Compare profiler output for multiple program runs of the same program or for different programs. Each program run is referred to as a session.
- Save profiling data for multiple sessions. A group of sessions is referred to as a project.
- Import/Export CUDA Profiler CSV format data

As i know, the profiler is available for Mac not for Linux. Unfortunately, my Mac does not have NVIDIA card and I do not use CUDA on my Mac either. I use it on Linux. =(

Thursday, October 08, 2009

NVIDIA's Fermi

As Moayad was in GPU Technology Conference last week, he just updated us some information about new CUDA Architecture called Fermi

Fermi
- 512 CUDA cores
- 8 faster with double precision
- on-chip shared memory
- giga thread with 16 kernels in parallel
- ECC support (ability to deal with soft error

For more info, see the white paper

Monday, September 21, 2009

New GeForce GTX 295

Dr. Box has ordered a new card for the research the I am working on. I have not installed it yet, probably need help from Himanshu. These are the specifications of it.

NVIDIA GeForce GTX 295
480 processor cores (240 per GPU)
Memory 1792 MB GDDR3 ( 896MB per GPU )
Memory Bandwidth 223.8 GB/sec

Friday, September 18, 2009

Landscape in Latex & Tex to Word

Found in Tex Blog.

To change the whole document use
\documentclass[landscape, 12pt]{report}
or
\usepackage[landscape]{geometry}

To change only single pages
\usepackage{lscape}
with
\begin{landscape}
...
\end{landscape}

Tex to Word
latex2rtf
Converters from LaTeX to PC Textprocessors

Thursday, July 30, 2009

CUDA Materials Collection

Other than NVIDIA CUDA Programming Guide
NVIDIA CUDA Basics [PDF]
NVIDIA CUDA Basics [WMV]
NVIDIA Optimizing CUDA [PDF]
NVIDIA Advanced CUDA Webinar: Memory Optimizations [WMV]

NVIDIA: GPU Computing Online Seminar

Wednesday, July 29, 2009

Threads, Blocks and Grids

From NVIDIA Optimizing CUDA

“Experiment” mode discovers and saves optimal configuration

Blocks per Grid Heuristics
Number of blocks should be more than number of multiprocessors, so all multiprocessors have at least one block to execute.
Number of blocks divided by number of multiprocessors should be more than 2
Multiple blocks can run concurrently in a multiprocessor
Blocks that aren’t waiting at a __syncthreads() keep the hardware busy
Subject to resource availability –registers, shared memory
Number of blocks that is more than 100 will scale to future devices

Optimizing threads per block
Choose threads per block as a multiple of warp size to avoid wasting computation on under-populated warps and to facilitate coalescing
Run as many warps as possible per multiprocessor (hide latency)
Multiprocessor can run up to 8 blocks at a time
Heuristics
Minimum: 64 threads per block (Only if multiple concurrent blocks)
192 or 256 threads a better choice (Usually still enough regs to compile and invoke successfully)
However, this all depends on your computation, so experiment!

Zero Copy (Mapped Memory): Directly Access Host Memory from The Device

Came across with this Zero Copy which has just been introduced in CUDA 2.2 when I was looking for the concept of threads and blocks.

Quoted from NVIDIA Optimizing CUDA

Access host memory directly from device code
Transfers implicitly performed as needed by device code
All set-up is done on host using mapped memory

What should be considered when using Zero Copy
Zero copy will always be a win for integrated devices that utilize CPU memory (check this using the integratedfield in cudaDeviceProp)
Zero copy will be faster if data is only read/written from/to global memory once: Copy input to GPU, One kernel run, Copy output to CPU
Potentially easier and faster alternative to using cudaMemcpyAsync
Current devices use pointers that are 32-bit so there is a limit of 4GB per context

Quoted from section 3.2.5.3 Mapped Memory in CUDA Programming Guide 2.2

On some devices, a block of page-locked host memory can also be mapped into the device’s address space by passing flag cudaHostAllocMapped to cudaHostAlloc(). Such a block has therefore two addresses: one in host memory and one in device memory. The host memory pointer is returned by cudaHostAlloc() and the device memory pointer can be retrieved using cudaHostGetDevicePointer() and used to access the block from within a
kernel.

Accessing host memory directly from within a kernel has several advantages:
There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;
There is no need to use streams (see Section 3.2.6.1) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events (see Section 3.2.6) to avoid any potential read-after-write, write-after-read, or write-after-
write hazards.

A block of page-locked host memory can be allocated as both mapped and portable (see Section 3.2.5.1), in which case each host thread that needs to map the block to its device address space must call cudaHostGetDevicePointer() to retrieve a device pointer, as device pointers will generally differ from one host thread to the other.

To be able to retrieve the device pointer to any mapped page-locked memory within a given host thread, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA operations is performed by the thread. Otherwise, cudaHostGetDevicePointer() will return an error.

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory.

Applications may query whether a device supports mapped page-locked host memory or not by calling cudaGetDeviceProperties() and checking the canMapHostMemory property.

CUDA Wrapping

Quoted from Section 4.1 of CUDA Programming Guide 2.2.1: A Set of SIMT Multiprocessors with On-Chip Shared Memory

The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity as illustrated in Figure 4-1. The threads of a thread block execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor consists of eight Scalar Processor (SP) cores, two special function units for transcendentals, a multithreaded instruction unit, and on-chip shared memory. The multiprocessor creates, manages, and executes concurrent threads in hardware with zero scheduling overhead. It implements the __syncthreads() barrier synchronization intrinsic with a single instruction. Fast barrier synchronization together with lightweight thread creation and zero-overhead thread scheduling efficiently support very fine-grained parallelism, allowing, for example, a low granularity decomposition of problems by assigning one thread to each data element (such as a pixel in an image, a voxel in a volume, a cell in a grid-based computation).

To manage hundreds of threads running several different programs, the multiprocessor employs a new architecture we call SIMT (single-instruction, multiple-thread). The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state. The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. (This term originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp.) Individual threads composing a SIMT warp start together at the same program address but are otherwise free to branch and execute independently.

When a multiprocessor is given one or more thread blocks to execute, it splits them into warps that get scheduled by the SIMT unit. The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Section 2.2 describes how thread IDs relate to thread indices in the block.

Every instruction issue time, the SIMT unit selects a warp that is ready to execute and issues the next instruction to the active threads of the warp. A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.

SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually.

Not quite understood but was trying to read more.

Tuesday, April 14, 2009

There must be something wrong with my 8800 GT

As I needed to record the execution time of matrix multiplication in order to show the performance of streaming on my paper, I got back to the GeForce 8800 GT machine again and ran the application. Once the results came out, it looked like this:

GeForce 8800 GT
MatDim: 16
Total with non stream: 0.11
Total with 8 streams: 47001.85

How come it took so long like that! So I ran the same application on Tesla C870 and the results looked like:

Tesla C870
MatDim: 16
Total with non stream: 0.05
Total with 8 streams: 0.16

There must be something wrong with my 8800 but what is it? :(

Wednesday, March 04, 2009

CUDA Timers vs Events

I just found the answer of the question why the times I got from CUDA timer are odd.

http://forums.nvidia.com/lofiversion/index.php?t72550.html

If you want to time the kernel accurately, use CUDA events. For example, look at the simpleStreams sample in the SDK to see how to use events for timing. Event API is described in the Programming Guide. Note that events are recorded on the GPU, so you'll be timing only GPU execution. The nice benefit is that clock resolution is the period of the GPU shader clock - you should get reliable timings even from a single kernel launch.

If you want to time operations including CPU involvement (like driver overhead), you should use your favorite CPU timer. Just make sure you understand the timer resolution. Also, as seibert pointed out, make sure to call cudaThreadSynchronize() before starting and then again before stopping the timer.

Do not ever use blocking CUDA calls (like memcopies) to achieve synchronicity - that will change your timings terribly.

Paulius

I should not have used the timer but should have used event handler... T_T

Sunday, January 25, 2009

Concept of Streams in CUDA

According to the model in the previous post, I have a feeling in my gut that there might not be correct. There would be something else so I get to look into the concept of streams again.

Excerpt from page 41, section 4.5.1.5 of Programming Guide: Asynchronous Concurrent Execution.

Applications manage concurrency through streams. A stream is a sequence of operations that execute in order. Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.

Any kernel launch, memory set, or memory copy function without a stream parameter or with a zero stream parameter begins only after all preceding operations are done, including operations that are part of streams, and no subsequent operation may begin until it is done.

Still cannot see what is wrong with it. T_T

Friday, January 23, 2009

Memory Copy Time Models

From today's meeting, Aig and James helped me to solve the problem that the completion time in simpleStreams is much higher than the expected time. And we also came up with some equations to model the completion time. There should be some diagram to explain but I have not drown it yet.

The equations to calculate the total time of memory copy are:
T_S = n(t + l), for synchronous memory copy
T_A = (n/s)t + ((n/s) + s - 1)l, for asynchronous memory copy.

n = number of data
s = number of streams
l = latency or overhead
t = copy time

Another idea that James raised in the meeting is about data scheduling, to keep the GPU occupied while the data are being copied from the CPU.

Thank Aig and James for your help. ;)

Thursday, January 22, 2009

Various Size of Matrix Multiplication

I am trying to come up with a model, so have to try to do observe behavior of some other applications. It is now a turn of Matrix Multiplication (without streams). Both matrixes had the same dimensions and ran on 8800 GT.

BLOCK_SIZE 1 (Matrix size of 5 x 5 elements)
copy time of A = 0.027000 ms
copy time of B = 0.023000 ms
kernel time = 0.037000 ms
time of result copy = 0.026000 ms

BLOCK_SIZE 2 (Matrix size of 10 x 10 elements)
copy time of A = 0.028000 ms
copy time of B = 0.023000 ms
kernel time = 0.059000 ms
time of result copy = 0.031000 ms

BLOCK_SIZE 4 (Matrix size of 20 x 20 elements)
copy time of A = 0.034000 ms
copy time of B = 0.029000 ms
kernel time = 0.059000 ms
time of result copy = 0.028000 ms

BLOCK_SIZE 8 (Matrix size of 40 x 40 elements)
copy time of A = 0.039000 ms
copy time of B = 0.029000 ms
kernel time = 0.060000 ms
time of result copy = 0.044000 ms

BLOCK_SIZE 16 (Matrix size of 80 x 80 elements)
copy time of A = 0.054000 ms
copy time of B = 0.046000 ms
kernel time = 0.056000 ms
time of result copy = 0.093000 ms

BLOCK_SIZE 32 (Matrix size of 160 x 160 elements)
copy time of A = 0.138000 ms
copy time of B = 0.119000 ms
kernel time = 0.111000 ms
time of result copy = 0.247000 ms
Test FAILED (GPU returns 0.00)

#####

Next...
Multiplier (of BLOCK_SIZE) should be varied to get more results from various dimension of matrixes.

Wednesday, January 14, 2009

Varying the Size of Block in Grid

It just came up in my mind that the block size in the previous test has been fixed to 512, one-dimension. But what would the results be if it changed.

By fixing the number of streams to 8 and the amount of data to 16M, the results are:

512x1
memcopy: 41.35
kernel: 39.56
non-streamed: 80.28 (80.92 expected)
8 streams: 42.80 (44.73 expected)
-------------------------------

256x1
memcopy: 41.36
kernel: 0.13
non-streamed: 40.75 (41.49 expected)
8 streams: 42.80 (5.30 expected)
-------------------------------

128x1
memcopy: 41.36
kernel: 0.13
non-streamed: 40.74 (41.49 expected)
8 streams: 42.81 (5.30 expected)
-------------------------------

64x1
memcopy: 41.36
kernel: 0.12
non-streamed: 40.75 (41.48 expected)
8 streams: 42.82 (5.29 expected)
-------------------------------

and the test failed for 1024x1 block size. Confused!!

Friday, January 02, 2009

simpleStreams :: Graph of the First Results

As I have told that I have been trying to get the results from running simpleStreams with varying parameters (on GeForce 8800GT). Here is the graph.

Seems like running with 8 streams could be the best solution for this graphic card, especially from 512K of data set or above. That might be because of the limitation of the card. :)

Thursday, December 25, 2008

OpenCL

What I am doing with simpleStreams is to vary the parameters (nstreams and dataset) and collect the results and present as a graph.

OpenCL would be very popular. I should learn it.
OpenCL - The open standard for parallel programming of heterogeneous systems

It supports both ATI Stream and CUDA.

Sunday, December 21, 2008

Driver API vs Runtime API

Captured from CUDA Programming Guide version 2.0

I just found another sample code of matrix multiplication which was implemented using driver API. Do not know how different between driver API and runtime API. But most of sample codes were written using runtime API.

Friday, December 12, 2008

Stanford's Video Clips about CUDA

Got these from Moayad!

Scalable Parallel Computing

Computing on the GPU

Stream Computing

Thursday, December 11, 2008

OpenCL supported on top of CUDA

I have got this link from Dr. Box
NVIDIA pioneering OpenCL support on top of CUDA

"NVIDIA, Apple's new MacBook chipset partner, is working hard to provide seamless support for OpenCL, the cross platform API Apple developed for Snow Leopard (10.6) to create a vendor neutral, open specification for parallel programming across any compliant GPU."

"Every GPU (including GeForce, Tesla and Quadro lines) from the GeForce 8 series onwards will support OpenCL. This gives OpenCL developers an installed base of more than 100 million GPUs."

"OpenCL, which has also been adopted as the latest GPGPU strategy by AMD, is a multi-vendor standard and so the expectation is that if a vendor has an OpenCL compliant implementation, code written in OpenCL should run seamlessly across their architectures. CUDA was designed to natively support all parallel computing interfaces and will seamlessly run OpenCL and future standards as they arise."

"Both C for CUDA and OpenCL share very similar constructs for defining data parallelism, so if the developers wish, porting that code to OpenCL after its full release, will be easy."

What's OpenCL?

Sunday, December 07, 2008

Vary the Number of Data in simpleStreams

Still working on simpleStreams.
Try varying the amount of data set (on 8800 GT).

number of data: 4 * 1024
memcopy: 0.07
kernel: 0.05
non-streamed: 0.06 (0.12 expected)
4 streams: 0.22 (0.07 expected with compute capability 1.1 or later)
-------------------------------

number of data: 16 * 1024
memcopy: 0.09
kernel: 0.09
non-streamed: 0.11 (0.18 expected)
4 streams: 0.25 (0.11 expected with compute capability 1.1 or later)
-------------------------------

number of data: 64 * 1024
memcopy: 0.21
kernel: 0.20
non-streamed: 0.35 (0.41 expected)
4 streams: 0.37 (0.25 expected with compute capability 1.1 or later)
-------------------------------

number of data: 128 * 1024
memcopy: 0.38
kernel: 0.35
non-streamed: 0.66 (0.73 expected)
4 streams: 0.56 (0.45 expected with compute capability 1.1 or later)
-------------------------------

number of data: 256 * 1024
memcopy: 0.71
kernel: 0.66
non-streamed: 1.29 (1.37 expected)
4 streams: 0.97 (0.84 expected with compute capability 1.1 or later)
-------------------------------

number of data: 512 * 1024
memcopy: 1.37
kernel: 1.28
non-streamed: 2.55 (2.65 expected)
4 streams: 1.79 (1.62 expected with compute capability 1.1 or later)
-------------------------------

number of data: 1 * 1024 * 1024
memcopy: 2.67
kernel: 2.51
non-streamed: 5.05 (5.18 expected)
4 streams: 3.42 (3.18 expected with compute capability 1.1 or later)
-------------------------------

number of data: 4 * 1024 * 1024
memcopy: 10.50
kernel: 9.92
non-streamed: 20.13 (20.41 expected)
4 streams: 11.00 (12.54 expected with compute capability 1.1 or later)
-------------------------------

number of data: 16 * 1024 * 1024
memcopy: 41.55
kernel: 39.53
non-streamed: 80.32 (81.07 expected)
4 streams: 43.12 (49.91 expected with compute capability 1.1 or later)
-------------------------------

number of data: 32 * 1024 * 1024
memcopy: 83.06
kernel: 0.13
non-streamed: 81.53 (83.18 expected)
4 streams: 86.24 (20.89 expected with compute capability 1.1 or later)
-------------------------------

number of data: 64 * 1024 * 1024
memcopy: 165.80
kernel: 0.12
non-streamed: 163.01 (165.92 expected)
4 streams: 172.94 (41.57 expected with compute capability 1.1 or later)
-------------------------------

Interesting!
1. Why the kernel abnormally took smaller time when the data set was over 16M?
2. It could not take advantage of streams when the data set was less than 128K and more than 16M.
3. When the data set was out of the presented range, the test would fail as the results below:

number of data: 1 * 1024
memcopy: 0.05
kernel: 0.05
non-streamed: 0.04 (0.10 expected)
4 streams: 0.24 (0.06 expected with compute capability 1.1 or later)
-------------------------------
a[0] = 0; c = 50
Test FAILED

number of data: 128 * 1024 * 1024
memcopy: 0.02
kernel: 0.03
non-streamed: 0.03 (0.05 expected)
4 streams: 0.11 (0.03 expected with compute capability 1.1 or later)
-------------------------------
a[0] = -1; c = 50
Test FAILED