You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of stream compaction is to remove the redundent elements in a given array. In this project, I implemented several stream compaction algorithms either in CPU or GPU. The main steps are basically the same for these implementations. First, map the input array to a boolean array where 1(true) stands for useful elements and 0(false) stands for the elements need to be removed. Then a prefix-sum scan is performed to find the indices of the elements we keep in the output array. Finally scatter the elements based on the boolean array and the index array to get the final result.
First the performances of naive and work efficient method with different block sizes are tested. Two different input array sizes are considered: a power of two length (256) and a non-power of two length (253). From the graph above we notice that the performance of naive gpu scan doesn't change much along with the change of block size, while the work efficient scan seems to achieve the best performance with a block size of 512. Block sizes for naive and work efficent scan will both set to 512 in the following tests.
We can see that the methods with better performance are CPU and thrust scan. See the next section for a more detailed discussion.
36
+
37
+
## Questions
38
+
### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
39
+
40
+
`CPU`: the computation is the bottle neck while memory access is fast.
41
+
42
+
`GPU Naive`: theoretically it should be faster than the CPU scan. Although the total number of addition ( O(nlog(n)) ) is much more than a simple CPU loop (n - 1), multiple threads compute at the same time. In the best situation, only O(log(n)) time is needed. But the memory access speed may become the bottle neck for GPU scan. Reading and writing global memory is really time consuming. Besides, the naive gpu implementation above needs further optimization. Warp divergence and bank conflict also slow down the performance.
43
+
44
+
`Work Efficient`: this algorithm requires O(n) additions to scan an array. When the array length goes up, theoratically it beats the naive GPU scan. But it also suffers from the problems mentioned above.
45
+
46
+
`Thrust`: warp partition is less occured in this implememntation. I guess it may also use the share memory to speed up memory access.
0 commit comments