Saturday, November 08, 2008

Parameters Variation

When I was trying to vary number of streams on 8800 (number of integers in the data set is 16*1024*1024). The results were as follow:

------------------------------------------------------------------------------------
memcopy: 41.54
kernel: 39.56
non-streamed: 80.29 (81.09 expected)
1 streams: 82.42 (81.09 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 41.55
kernel: 39.56
non-streamed: 80.29 (81.11 expected)
2 streams: 44.14 (60.33 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 41.54
kernel: 39.55
non-streamed: 80.29 (81.09 expected)
4 streams: 43.12 (49.93 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 41.55
kernel: 39.56
non-streamed: 80.32 (81.11 expected)
8 streams: 42.92 (44.75 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 41.55
kernel: 39.54
non-streamed: 80.28 (81.09 expected)
16 streams: 45.32 (42.14 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 41.54
kernel: 39.56
non-streamed: 80.28 (81.10 expected)
32 streams: 56.81 (40.85 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 41.54
kernel: 39.56
non-streamed: 80.29 (81.10 expected)
64 streams: 71.08 (40.21 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------

Well...when too much streams were defined, the execution time was slow. The optimize value should be 8 streams.

Let's try on Tesla:

------------------------------------------------------------------------------------
memcopy: 20.64
kernel: 50.06
non-streamed: 70.68 (70.70 expected)
1 streams: 70.73 (70.70 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 20.80
kernel: 50.07
non-streamed: 70.81 (70.87 expected)
2 streams: 71.02 (60.47 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 20.67
kernel: 50.07
non-streamed: 70.68 (70.74 expected)
4 streams: 70.96 (55.24 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 20.65
kernel: 50.07
non-streamed: 70.83 (70.72 expected)
8 streams: 71.17 (52.65 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 20.84
kernel: 50.08
non-streamed: 70.81 (70.92 expected)
16 streams: 71.26 (51.38 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 20.62
kernel: 50.07
non-streamed: 70.68 (70.68 expected)
32 streams: 71.29 (50.71 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------
memcopy: 20.61
kernel: 50.07
non-streamed: 70.69 (70.68 expected)
64 streams: 71.68 (50.39 expected with compute capability 1.1 or later)
------------------------------------------------------------------------------------

Seems like there was nothing different for all values. The number of streams took no effect on Tesla.

No comments:

Post a Comment