Skip to content

Commit 65f128f

Browse files
committed
Readme
1 parent 070afbf commit 65f128f

5 files changed

Lines changed: 82 additions & 3 deletions

File tree

Project2-Stream-Compaction/README.md

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ CUDA Stream Compaction
77
* [LinkedIn](https://www.linkedin.com/in/jiangping-xu-365b19134/)
88
* Tested on: Windows 10, i7-4700MQ @ 2.40GHz 8GB, GT 755M 6100MB (personal laptop)
99
_________________________________________________________________________
10-
[Introduction](#Stream-Compaction) - [Performance Analysis](#performance-analysis) - [Questions](#questions)
10+
[Introduction](#Stream-Compaction) - [Performance Analysis](#performance-analysis) - [Questions](#questions) - [Output](#output)
1111
_________________________________________________________________________
1212
## Introduction
1313
The goal of stream compaction is to remove the redundent elements in a given array. In this project, I implemented several stream compaction algorithms either in CPU or GPU. The main steps are basically the same for these implementations. First, map the input array to a boolean array where 1(true) stands for useful elements and 0(false) stands for the elements need to be removed. Then a prefix-sum scan is performed to find the indices of the elements we keep in the output array. Finally scatter the elements based on the boolean array and the index array to get the final result.
@@ -21,5 +21,84 @@ All features are as follows:
2121
* A GPU scan using Thrust library.
2222

2323
## Performance Analysis
24+
__Find The Optimal Block Size__
25+
<p align="center">
26+
<img src = img/ScanTimeCostWithIncreasingBlockSize.png>
27+
</p>
2428

29+
First the performances of naive and work efficient method with different block sizes are tested. Two different input array sizes are considered: a power of two length (256) and a non-power of two length (253). From the graph above we notice that the performance of naive gpu scan doesn't change much along with the change of block size, while the work efficient scan seems to achieve the best performance with a block size of 512. Block sizes for naive and work efficent scan will both set to 512 in the following tests.
30+
31+
__Comparison__
32+
<p align="center">
33+
<img src = img/ScanTimeCostWithIncreasingInputArrayLength.png>
34+
</p>
35+
We can see that the methods with better performance are CPU and thrust scan. See the next section for a more detailed discussion.
36+
37+
## Questions
38+
### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
39+
40+
`CPU`: the computation is the bottle neck while memory access is fast.
41+
42+
`GPU Naive`: theoretically it should be faster than the CPU scan. Although the total number of addition ( O(nlog(n)) ) is much more than a simple CPU loop (n - 1), multiple threads compute at the same time. In the best situation, only O(log(n)) time is needed. But the memory access speed may become the bottle neck for GPU scan. Reading and writing global memory is really time consuming. Besides, the naive gpu implementation above needs further optimization. Warp divergence and bank conflict also slow down the performance.
43+
44+
`Work Efficient`: this algorithm requires O(n) additions to scan an array. When the array length goes up, theoratically it beats the naive GPU scan. But it also suffers from the problems mentioned above.
45+
46+
`Thrust`: warp partition is less occured in this implememntation. I guess it may also use the share memory to speed up memory access.
47+
48+
## Output
49+
blockSize = 512, ArraySize = 256 / 253
50+
```
51+
****************
52+
** SCAN TESTS **
53+
****************
54+
[ 9 35 14 13 10 25 14 25 43 0 7 36 26 ... 38 0 ]
55+
==== cpu scan, power-of-two ====
56+
elapsed time: 0.0008ms (std::chrono Measured)
57+
[ 0 9 44 58 71 81 106 120 145 188 188 195 231 ... 6032 6070 ]
58+
==== cpu scan, non-power-of-two ====
59+
elapsed time: 0.0008ms (std::chrono Measured)
60+
[ 0 9 44 58 71 81 106 120 145 188 188 195 231 ... 5983 5998 ]
61+
passed
62+
==== naive scan, power-of-two ====
63+
elapsed time: 0.05296ms (CUDA Measured)
64+
passed
65+
==== naive scan, non-power-of-two ====
66+
elapsed time: 0.0496ms (CUDA Measured)
67+
passed
68+
==== work-efficient scan, power-of-two ====
69+
elapsed time: 0.092896ms (CUDA Measured)
70+
passed
71+
==== work-efficient scan, non-power-of-two ====
72+
elapsed time: 0.090368ms (CUDA Measured)
73+
passed
74+
==== thrust scan, power-of-two ====
75+
elapsed time: 0.093824ms (CUDA Measured)
76+
passed
77+
==== thrust scan, non-power-of-two ====
78+
elapsed time: 0.092896ms (CUDA Measured)
79+
passed
80+
81+
*****************************
82+
** STREAM COMPACTION TESTS **
83+
*****************************
84+
[ 1 1 2 1 2 3 0 3 1 2 3 2 2 ... 2 0 ]
85+
==== cpu compact without scan, power-of-two ====
86+
elapsed time: 0.0013ms (std::chrono Measured)
87+
[ 1 1 2 1 2 3 3 1 2 3 2 2 2 ... 2 2 ]
88+
passed
89+
==== cpu compact without scan, non-power-of-two ====
90+
elapsed time: 0.0011ms (std::chrono Measured)
91+
[ 1 1 2 1 2 3 3 1 2 3 2 2 2 ... 1 2 ]
92+
passed
93+
==== cpu compact with scan ====
94+
elapsed time: 0.0049ms (std::chrono Measured)
95+
[ 1 1 2 1 2 3 3 1 2 3 2 2 2 ... 2 2 ]
96+
passed
97+
==== work-efficient compact, power-of-two ====
98+
elapsed time: 0.222208ms (CUDA Measured)
99+
passed
100+
==== work-efficient compact, non-power-of-two ====
101+
elapsed time: 0.157696ms (CUDA Measured)
102+
passed
103+
```
25104

111 KB
Loading
156 KB
Loading

Project2-Stream-Compaction/stream_compaction/efficient.cu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#include "efficient.h"
55

66
#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
7-
#define blockSize 64
7+
#define blockSize 512
88

99
namespace StreamCompaction {
1010
namespace Efficient {

Project2-Stream-Compaction/stream_compaction/naive.cu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#include "naive.h"
55

66
#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
7-
#define blockSize 64
7+
#define blockSize 512
88

99
namespace StreamCompaction {
1010
namespace Naive {

0 commit comments

Comments
 (0)