Still working on simpleStreams.
Try varying the amount of data set (on 8800 GT).
number of data: 4 * 1024
memcopy: 0.07
kernel: 0.05
non-streamed: 0.06 (0.12 expected)
4 streams: 0.22 (0.07 expected with compute capability 1.1 or later)
-------------------------------
number of data: 16 * 1024
memcopy: 0.09
kernel: 0.09
non-streamed: 0.11 (0.18 expected)
4 streams: 0.25 (0.11 expected with compute capability 1.1 or later)
-------------------------------
number of data: 64 * 1024
memcopy: 0.21
kernel: 0.20
non-streamed: 0.35 (0.41 expected)
4 streams: 0.37 (0.25 expected with compute capability 1.1 or later)
-------------------------------
number of data: 128 * 1024
memcopy: 0.38
kernel: 0.35
non-streamed: 0.66 (0.73 expected)
4 streams: 0.56 (0.45 expected with compute capability 1.1 or later)
-------------------------------
number of data: 256 * 1024
memcopy: 0.71
kernel: 0.66
non-streamed: 1.29 (1.37 expected)
4 streams: 0.97 (0.84 expected with compute capability 1.1 or later)
-------------------------------
number of data: 512 * 1024
memcopy: 1.37
kernel: 1.28
non-streamed: 2.55 (2.65 expected)
4 streams: 1.79 (1.62 expected with compute capability 1.1 or later)
-------------------------------
number of data: 1 * 1024 * 1024
memcopy: 2.67
kernel: 2.51
non-streamed: 5.05 (5.18 expected)
4 streams: 3.42 (3.18 expected with compute capability 1.1 or later)
-------------------------------
number of data: 4 * 1024 * 1024
memcopy: 10.50
kernel: 9.92
non-streamed: 20.13 (20.41 expected)
4 streams: 11.00 (12.54 expected with compute capability 1.1 or later)
-------------------------------
number of data: 16 * 1024 * 1024
memcopy: 41.55
kernel: 39.53
non-streamed: 80.32 (81.07 expected)
4 streams: 43.12 (49.91 expected with compute capability 1.1 or later)
-------------------------------
number of data: 32 * 1024 * 1024
memcopy: 83.06
kernel: 0.13
non-streamed: 81.53 (83.18 expected)
4 streams: 86.24 (20.89 expected with compute capability 1.1 or later)
-------------------------------
number of data: 64 * 1024 * 1024
memcopy: 165.80
kernel: 0.12
non-streamed: 163.01 (165.92 expected)
4 streams: 172.94 (41.57 expected with compute capability 1.1 or later)
-------------------------------
Interesting!
1. Why the kernel abnormally took smaller time when the data set was over 16M?
2. It could not take advantage of streams when the data set was less than 128K and more than 16M.
3. When the data set was out of the presented range, the test would fail as the results below:
number of data: 1 * 1024
memcopy: 0.05
kernel: 0.05
non-streamed: 0.04 (0.10 expected)
4 streams: 0.24 (0.06 expected with compute capability 1.1 or later)
-------------------------------
a[0] = 0; c = 50
Test FAILED
number of data: 128 * 1024 * 1024
memcopy: 0.02
kernel: 0.03
non-streamed: 0.03 (0.05 expected)
4 streams: 0.11 (0.03 expected with compute capability 1.1 or later)
-------------------------------
a[0] = -1; c = 50
Test FAILED
No comments:
Post a Comment