diff --git a/README.md b/README.md index 0e38ddb..a701eaa 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,283 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Megan Reddy + * [LinkedIn](https://www.linkedin.com/in/meganr25a949125/), [personal website](https://meganr28.github.io/) +* Tested on: Windows 10, AMD Ryzen 9 5900HS with Radeon Graphics @ 3301 MHz 16GB, NVIDIA GeForce RTX 3060 Laptop GPU 6GB (Personal Computer) +* Compute Capability: 8.6 -### (TODO: Your README) +### Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +**Stream compaction** is a common algorithm used in path tracing, collision detection, sparse matrix multiplication, and many other applications. +The idea is to produce a new array with elements that meet a certain criteria, while preserving the order of the original array. Elements that do not meet the criteria are "discarded" and do not make +it into the final array. An illustration of this process provided below: + +

+ +

+

Image Credit: NVIDIA

+ +#### Prefix-Sum + +The all-prefix-sums or **scan** operation on an array is the consecutive sum of the first [0, 1, ..., n] elements of the array. + +``` +a0 = I +a1 = a0 +a2 = a0 + a1 +... + +Input array: 3 1 7 0 4 1 6 3 +Output array: 0 3 4 11 11 15 16 22 +``` +There are two types of scan operations - the **exclusive scan** and the **inclusive scan**. + +##### Exclusive Scan + +In an exclusive scan, the element j of the result does not include element j of the input. + +``` +3 1 7 0 4 1 6 3 +0 3 4 11 11 15 16 22 +``` + +For this assignment, the term "scan" will refer to an exclusive scan. + +##### Inclusive Scan + +In an inclusive scan, the element j of the result includes element j of the input. + +``` +3 1 7 0 4 1 6 3 +3 4 11 11 15 16 22 25 +``` + +#### Applications to Stream Compaction + +In stream compaction, there are three main steps: + +1. Computing a temporary array of 1s/0s signifying elements that meet the criteria +2. Running an exclusive scan on the temporary array (indices of elements in output array) +3. Write elements that meet the criteria to the output array (scattering) + +#### CPU vs. GPU Scan Implementations + +##### CPU - Serial Version + +The serial version of **scan** consists of a for loop that examines each pair of elements at the previous index and sums them. +CPU stream compaction passes over the input array once to create a temporary array of booleans. It then scans the boolean array +to compute an array of indices to correctly place the valid elements in the output array. The final pass checks to see if an element +has passed using the boolean array, and if it has, retrieves its index from the scan array. It write the original input value to this index +in the output array. The total runtime in O(n). + +##### GPU - Parallel Version + +There are two ways to implement the parallel version of the scan algorithm. One is the **naive** method, which launches n threads +to compute partial sums. A visualization is below (image credit: [NVIDIA](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)): + +![Naive](img/figure-39-2.jpg) + +``` +for d = 1 to log2 n: + for all k in parallel: + if k >= 2^(d-1) then + out[k] += in[k – 2^(d-1)] + in[k] + else + out[k] = in[k] +``` + +However, this algorithm is not very efficient. It performs a total of O(nlog2n) adds, whereas the CPU implementation performs O(n) adds. +We can improve this by using the **work-efficient** scan algorithm. This algorithm removes the extra log2n work by building a balanced binary tree +from the input data and sweeping this tree up and down to compute the prefix sum (image credit: [NVIDIA](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)). + +![Upsweep](img/upsweep.jpg) \ +*An illustration of the upsweep (reduction) portion of the algorithm.* + +![Downsweep](img/downsweep.jpg) \ +*An illustration of the downsweep portion of the algorithm.* + +``` +// Upsweep +for d = 0 to log2 n - 1: + for all k = 0 to n – 1 by 2^(d+1) in parallel: + out[k + 2^(d+1) – 1] += out[k + 2^(d) – 1] + +// Downsweep +out[n – 1] = 0 +for d = log2n – 1 down to 0 do + for all k = 0 to n – 1 by 2^(d +1) in parallel do + t = x[k + 2^d – 1] + x[k + 2^d – 1] = x[k + 2^(d +1) – 1] + x[k + 2^(d +1) – 1] += t +``` + +A third way of implementing scan on the GPU is using the **Thrust** library. +Thrust comes with a build-in function called ```exclusive_scan()``` that can be called on device vectors. +An example invocation would be ```thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin())```. + +### Extra Credit Features + +#### Why is My GPU Approach So Slow? + +In the diagrams below, we can see that the original work-efficient scan algorithm causes warp divergence. Not all threads +in the warp are active, yet the warp cannot be retired since other threads are performing work. In the original upsweep/downsweep procedures, +we half/double the number of threads doing work each time. These threads are not at consecutive indices, leaving other threads in the warp idle. +In the optimized version, we instead make threads at consecutive indices do the work so that we can retire unused warps early and limit divergence. + +| 4 divergent warps | 1 divergent warp | +:-------------------------:|:-------------------------: +![Divergent Warps](img/4-div-warp.PNG) | ![Coherent Warps](img/1-div-warp.PNG) + +In code, this is a simple change to the way we index. For example, in the upsweep kernel, we check to see that `thread_id < stride` and if so, we make that +thread do work. We do have to recalculate the index offsets to make the threads write to the correct places. The same positions are written to, but now different thread numbers do +the work. + +``` +if (index < stride) { + out[offset_d * (2 * index + 2) - 1] += out[offset_d * (2 * index + 1) - 1]; +} + +// NOTE: offset_d = 2^d +``` + +The downsweep kernel is also upgraded to examine consecutive thread indices and can be seen in `efficient.cu`. A performance comparison between the optimized and unoptimized versions +can be found in the **Performance Analysis** section below. + +### Performance Analysis + +All performance data was recorded in **Release Mode** in **Visual Studio 2019**. Additionally, the results for non-power-of-two +arrays are plotted since power-of-two arrays may not be encountered as often. + +Before comparing each implementation, it was necessary to find the optimal block size for each GPU algorithm. +An array size of **225 (33,554,432) elements** was used for these tests. A comparison of the runtime is +shown below. + +#### Runtime vs. Block Size (Scan) + +![Block Size vs. Runtime Scan](img/block-scan.png) + +#### Runtime vs. Block Size (Compaction) + +![Boids Size vs. Runtime Compaction](img/block-optimize.png) + +The optimal block size was 128, 256, and 512 for the naive, uoptimized work-efficient, and optimized work-efficient implementations, respectively. +It is worth noting that there is not much performance gain from increasing block size past 256 for both +implementations. In fact, it seems to slow down the implementations a bit (notice how the curves on both +the scan and compaction graphs slope upwards after 256). This is most likely because with larger block sizes, +there are potentially many unused threads or lower occupancy. Since a GPU's goal is to hide latency and memory +accesses, it needs to choose a block size that will allow it to have more active blocks running at once. + +Now that the optimal block size is chosen for both implementations, we can compare how both the CPU and GPU algorithms +perform on arrays of different sizes. In order to determine the runtimes, a timer was started after initial memory +allocation and ended before deallocation. + +#### Runtime vs. Array Size (Scan) + +![Array Size vs. Runtime Scan](img/array-scan.png) + +The Thrust implementation performed the fastest out of all the implementations. Since this is a well-established and tested library, +there are likely several optimizations that make it perform better than the other implementations. When examining the timeline graph +using Nsight Systems 2022.3.4, we can see that Thrust uses `cudaMemcpyAsync` as well as `cudaStreamSynchronize` instead of `cudaDeviceSynchronize`. + +![Thrust Timeline](img/thrust-memcpy.PNG) + +The CPU implementation performed faster than the naive and work-efficient implementations for smaller array sizes (< 15 million elements). Since memory overhead +was not factored into timing, one possible cause of this is that for small array sizes, the extra overhead of launching kernels made GPU implementations slower. +Additionally, warp divergence and high numbers of idle threads could have made the GPU implementations less efficient. +For larger array sizes (> 20 million elements), we can start to see the benefits of using a parallel implementation. The work-efficient implementation significantly outperforms +the CPU and Naive implementations at high element counts. Since the CPU is processing each element serially, it is understandable why it becomes significantly slower with many elements. +The Naive version is slower since it adds an extra factor of log2n to its runtime, which can become expensive for high element counts. + +The optimized version of the work-efficient algorithm performed better than the unoptimized version throughout. Since the indexing scheme +during the upsweep and downsweep steps was changed to partition based on consecutive increasing thread indices, there were less divergent branches between threads and warps could be retired early. + +![Unoptimized vs. Optimized Work-Efficient](img/we-optimize.png) + +#### Runtime vs. Array Size (Compaction) + +![Array Size vs. Runtime Compaction](img/array-compact.png) + +The CPU with Scan implementation performed the slowest. This is most likely because there are essentially three passes over the data, each iteration being of size n. +The boolean array pass scans each element to see if it meets the criteria. Then, the scan step examines each element of the boolean array to create an index array. Finally the +scatter step examines each element to see if it met the criteria, and if so, adds it to the final output array. In total, the runtime is O(3n), or just O(n). + +The CPU without Scan implementation is faster because there is only one pass over all the elements, but still slower than the GPU implementations. One possible cause of this is the +branch condition that checks if the value meets the criteria - `val != 0`. If the CPU predicts this branch incorrectly, it must discard the rest of the pipeline that comes after that condition +start from the correct instruction. + +The optimized and unoptimized work-efficient implementations performed about the same, with the optimized version being slightly faster. Since each step of the compaction process is parallelized, +it can handle large arrays much more efficiently than the CPU implementations. + +#### Test Program Output + +Here is the test program output for **225 (33,554,432) elements**, with block size 128 for the naive implementation and block size 512 for the optimized work-efficient implementation +(the optimal block sizes). + +``` +**************** +** SCAN TESTS ** +**************** + [ 27 33 5 47 14 46 19 6 22 17 19 33 30 ... 3 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 18.5357ms (std::chrono Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750646 821750649 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 18.733ms (std::chrono Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750585 821750603 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 28.5788ms (CUDA Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750646 821750649 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 28.5338ms (CUDA Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750585 821750603 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 16.9636ms (CUDA Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750646 821750649 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 16.9779ms (CUDA Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750585 821750603 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.17078ms (CUDA Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750646 821750649 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 1.16835ms (CUDA Measured) + [ 0 27 60 65 112 126 172 191 197 219 236 255 288 ... 821750585 821750603 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 2 3 3 2 0 0 1 1 3 2 0 3 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 57.5399ms (std::chrono Measured) + [ 1 2 3 3 2 1 1 3 2 3 3 2 2 ... 1 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 56.823ms (std::chrono Measured) + [ 1 2 3 3 2 1 1 3 2 3 3 2 2 ... 3 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 111.362ms (std::chrono Measured) + [ 1 2 3 3 2 1 1 3 2 3 3 2 2 ... 1 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 21.1722ms (CUDA Measured) + [ 1 2 3 3 2 1 1 3 2 3 3 2 2 ... 1 2 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 20.9019ms (CUDA Measured) + [ 1 2 3 3 2 1 1 3 2 3 3 2 2 ... 3 1 ] + passed +``` + +### References + +* [GPU Gems 3 Chapter 39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) +* [UPenn CIS 565 Course Notes](https://cis565-fall-2022.github.io/) diff --git a/img/1-div-warp.PNG b/img/1-div-warp.PNG new file mode 100644 index 0000000..a115f7c Binary files /dev/null and b/img/1-div-warp.PNG differ diff --git a/img/4-div-warp.PNG b/img/4-div-warp.PNG new file mode 100644 index 0000000..3548181 Binary files /dev/null and b/img/4-div-warp.PNG differ diff --git a/img/array-compact.png b/img/array-compact.png new file mode 100644 index 0000000..2e8c16a Binary files /dev/null and b/img/array-compact.png differ diff --git a/img/array-scan.png b/img/array-scan.png new file mode 100644 index 0000000..6930e60 Binary files /dev/null and b/img/array-scan.png differ diff --git a/img/block-optimize.png b/img/block-optimize.png new file mode 100644 index 0000000..f8b0ef1 Binary files /dev/null and b/img/block-optimize.png differ diff --git a/img/block-scan.png b/img/block-scan.png new file mode 100644 index 0000000..82f27ba Binary files /dev/null and b/img/block-scan.png differ diff --git a/img/downsweep.jpg b/img/downsweep.jpg new file mode 100644 index 0000000..88be7ff Binary files /dev/null and b/img/downsweep.jpg differ diff --git a/img/stream-compaction.jpg b/img/stream-compaction.jpg new file mode 100644 index 0000000..d3438dc Binary files /dev/null and b/img/stream-compaction.jpg differ diff --git a/img/thrust-memcpy.PNG b/img/thrust-memcpy.PNG new file mode 100644 index 0000000..705990e Binary files /dev/null and b/img/thrust-memcpy.PNG differ diff --git a/img/upsweep.jpg b/img/upsweep.jpg new file mode 100644 index 0000000..22c4da0 Binary files /dev/null and b/img/upsweep.jpg differ diff --git a/img/we-optimize.png b/img/we-optimize.png new file mode 100644 index 0000000..fa2a5e2 Binary files /dev/null and b/img/we-optimize.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..f834b82 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 25; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -31,6 +31,10 @@ int main(int argc, char* argv[]) { a[SIZE - 1] = 0; printArray(SIZE, a, true); + //int in[8] = { 0, 1, 2, 3, 4, 5, 6, 7 }; + //int out[8] = { 0, 0, 0, 0, 0, 0, 0, 0 }; + //StreamCompaction::Efficient::scan(8, out, in); + // initialize b using StreamCompaction::CPU::scan you implement // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. // At first all cases passed because b && c are all zeroes. @@ -64,7 +68,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..5e0bc3e 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,12 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + bools[index] = (idata[index] != 0) ? 1 : 0; } /** @@ -32,7 +37,16 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + int meets_criteria = bools[index]; + int final_idx = indices[index]; + if (meets_criteria) { + odata[final_idx] = idata[index]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..067a544 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,10 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + odata[0] = 0; + for (int i = 1; i < n; ++i) { + odata[i] = odata[i - 1] + idata[i - 1]; + } timer().endCpuTimer(); } @@ -30,9 +33,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int j = 0; + for (int i = 0; i < n; ++i) { + // If data meets criteria (not 0), add it to the output list + if (idata[i] != 0) { + odata[j++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return j; } /** @@ -41,10 +50,35 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int* temp = new int[n]; + int* scan = new int[n]; + timer().startCpuTimer(); - // TODO + // Make temporary array with 0s/1s to indicate if data meets criteria + for (int i = 0; i < n; ++i) { + temp[i] = (idata[i] != 0) ? 1 : 0; + } + + // Run exclusive scan on temporary array + scan[0] = 0; + for (int j = 1; j < n; ++j) { + scan[j] = scan[j - 1] + temp[j - 1]; + } + + // Scatter + int elements = 0; + for (int k = 0; k < n; ++k) { + int meets_criteria = temp[k]; + int index = scan[k]; + if (meets_criteria) { + odata[index] = idata[k]; + ++elements; + } + } timer().endCpuTimer(); - return -1; + + delete temp, scan; + return elements; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..b70a535 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,11 @@ #include "common.h" #include "efficient.h" +/*! Block size used for CUDA kernel launch. */ +#define blockSize 512 + +#define OPTIMIZED + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +17,112 @@ namespace StreamCompaction { return timer; } + __global__ void kernUpsweep(int n, int stride, int offset_d_one, int offset_d, int* odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); +#ifdef OPTIMIZED + if (index < stride) { + odata[offset_d * (2 * index + 2) - 1] += odata[offset_d * (2 * index + 1) - 1]; + } +#else + if ((index % stride) == 0) { + odata[index + offset_d_one - 1] += odata[index + offset_d - 1]; + } +#endif + } + + __global__ void kernDownsweep(int n, int d, int stride, int offset_d_one, int offset_d, int* odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); +#ifdef OPTIMIZED + if (index < d) { + // Old left child + int t = odata[stride * (2 * index + 1) - 1]; + // Set left child to current node value + odata[stride * (2 * index + 1) - 1] = odata[stride * (2 * index + 2) - 1]; + // Set right child = old left child + current node value + odata[stride * (2 * index + 2) - 1] += t; + } +#else + if ((index % stride) == 0) { + // Old left child + int t = odata[index + offset_d - 1]; + // Set left child to current node value + odata[index + offset_d - 1] = odata[index + offset_d_one - 1]; + // Set right child = old left child + current node value + odata[index + offset_d_one - 1] += t; + } +#endif + } /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. + * Helper function for "scan". */ - void scan(int n, int *odata, const int *idata) { + void prefixSum(int n, int *odata, const int *idata) { + int log2_n = ilog2ceil(n); + int N = pow(2, log2_n); // next highest power of 2 + dim3 blocksPerGrid((N + blockSize - 1) / blockSize); + + int* dev_scan_output; + + // Allocate device memory (size 2^log2n) + cudaMalloc((void**)&dev_scan_output, N * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scan_output failed!"); + + cudaMemset(dev_scan_output, 0, N * sizeof(int)); + checkCUDAErrorFn("second cudaMemset dev_scan_output failed!"); + + // Copy data to the GPU + cudaMemcpy(dev_scan_output, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("memcpy to GPU failed!"); + timer().startGpuTimer(); - // TODO + // Perform upsweep (inclusive scan) +#ifdef OPTIMIZED + int stride = N / 2; +#else + int stride = 2; +#endif + for (int d = 0; d < log2_n; ++d) { + int offset_d_one = pow(2, d + 1); + int offset_d = pow(2, d); + kernUpsweep << > > (N, stride, offset_d_one, offset_d, dev_scan_output); +#ifdef OPTIMIZED + stride /= 2; +#else + stride *= 2; +#endif + } + + // Set "root" (last element in array) to 0 + cudaMemset(dev_scan_output + N - 1, 0, sizeof(int)); + checkCUDAErrorFn("first cudaMemset dev_scan_output failed!"); + + // Perform downsweep +#ifdef OPTIMIZED + stride = N / 2; + for (int d = 1; d < N; d *= 2) { +#else + stride /= 2; + for (int d = log2_n - 1; d >= 0; --d) { +#endif + int offset_d_one = pow(2, d + 1); + int offset_d = pow(2, d); + kernDownsweep << > > (N, d, stride, offset_d_one, offset_d, dev_scan_output); + stride /= 2; + } timer().endGpuTimer(); + + cudaMemcpy(odata, dev_scan_output, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAErrorFn("memcpy to CPU failed!"); + + // Cleanup memory + cudaFree(dev_scan_output); + checkCUDAErrorFn("cudaFree failed!"); + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + prefixSum(n, odata, idata); } /** @@ -31,10 +135,95 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int log2_n = ilog2ceil(n); + int N = pow(2, log2_n); // next highest power of 2 + dim3 blocksPerGrid((N + blockSize - 1) / blockSize); + + int elements[1]; + int* dev_bool_input; + int* dev_bools; + int* dev_scatter_input; + int* dev_scatter_output; + + // Allocate device memory (size 2^log2n) + cudaMalloc((void**)&dev_bool_input, N * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scan_temp failed!"); + cudaMalloc((void**)&dev_bools, N * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scan_temp failed!"); + cudaMalloc((void**)&dev_scatter_input, N * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scatter_input failed!"); + cudaMalloc((void**)&dev_scatter_output, N * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scatter_output failed!"); + + // Copy data to the GPU + cudaMemcpy(dev_bool_input, idata, N * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAErrorFn("memcpy to GPU failed!"); + + // Zero the scan output array (serves as extra padding for non-power-of-two arrays) + cudaMemset(dev_bools, 0, N * sizeof(int)); + cudaMemset(dev_scatter_input, 0, N * sizeof(int)); + checkCUDAErrorFn("cudaMemset dev_scatter_input failed!"); + + // Make temporary array with 0s / 1s to indicate if data meets criteria timer().startGpuTimer(); - // TODO + StreamCompaction::Common::kernMapToBoolean<<> > (n, dev_bools, dev_bool_input); + cudaMemcpy(dev_scatter_input, dev_bools, n * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAErrorFn("memcpy to GPU failed!"); + + // Run exclusive scan on temporary array + // Upsweep +#ifdef OPTIMIZED + int stride = N / 2; +#else + int stride = 2; +#endif + for (int d = 0; d < log2_n; ++d) { + int offset_d_one = pow(2, d + 1); + int offset_d = pow(2, d); + kernUpsweep << > > (N, stride, offset_d_one, offset_d, dev_scatter_input); +#ifdef OPTIMIZED + stride /= 2; +#else + stride *= 2; +#endif + } + + // Get element count + cudaMemcpy(&elements, dev_scatter_input + N - 1, sizeof(int), cudaMemcpyDeviceToHost); + + // Set "root" (last element in array) to 0 + cudaMemset(dev_scatter_input + N - 1, 0, sizeof(int)); + checkCUDAErrorFn("cudaMemset dev_scatter_input failed!"); + + // Downsweep +#ifdef OPTIMIZED + stride = N / 2; + for (int d = 1; d < N; d *= 2) { +#else + stride /= 2; + for (int d = log2_n - 1; d >= 0; --d) { +#endif + int offset_d_one = pow(2, d + 1); + int offset_d = pow(2, d); + kernDownsweep << > > (N, d, stride, offset_d_one, offset_d, dev_scatter_input); + stride /= 2; + } + + // Scatter + StreamCompaction::Common::kernScatter << > > (N, dev_scatter_output, dev_bool_input, dev_bools, dev_scatter_input); timer().endGpuTimer(); - return -1; + + // Copy data back to CPU + cudaMemcpy(odata, dev_scatter_output, N * sizeof(int), cudaMemcpyDeviceToHost); + + // Cleanup memory + cudaFree(dev_bool_input); + cudaFree(dev_bools); + cudaFree(dev_scatter_input); + cudaFree(dev_scatter_output); + checkCUDAErrorFn("cudaFree failed!"); + + return elements[0]; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..9327b30 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,9 @@ #include "common.h" #include "naive.h" +/*! Block size used for CUDA kernel launch. */ +#define blockSize 128 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +14,80 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + /** + * Shifts the array one to the right. + */ + __global__ void kernInclToExcl(int n, int* odata, const int* idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + odata[index] = (index > 0) ? idata[index - 1] : 0; + } + + /** + * Performs one thread's work for a naive parallel prefix-sum. + */ + __global__ void kernNaiveScan(int n, int d, int offset, int* odata, const int* idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + if (index >= offset) { + odata[index] = idata[index - offset] + idata[index]; + } + else { + odata[index] = idata[index]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + // Determine block size + dim3 blocksPerGrid((n + blockSize - 1) / blockSize); + + // Allocate seperate arrays to hold results between iterations + int* dev_scan_input; + int* dev_scan_output; + + // Allocate device memory + cudaMalloc((void**)&dev_scan_input, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scan_input failed!"); + cudaMalloc((void**)&dev_scan_output, n * sizeof(int)); + checkCUDAErrorFn("cudaMalloc dev_scan_output failed!"); + + // Copy data to the GPU + cudaMemcpy(dev_scan_input, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemcpy(dev_scan_output, odata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAErrorFn("memcpy to GPU failed!"); + timer().startGpuTimer(); - // TODO + // Transform inclusive array to exclusive array + kernInclToExcl << > > (n, dev_scan_output, dev_scan_input); + + // Perform exclusive scan + int log2_n = ilog2ceil(n); + int offset = 1; + for (int d = 1; d <= log2_n; ++d) { + kernNaiveScan << > > (n, d, offset, dev_scan_input, dev_scan_output); + std::swap(dev_scan_input, dev_scan_output); + offset *= 2; + } timer().endGpuTimer(); + + // Copy data back to the CPU + cudaMemcpy(odata, dev_scan_output, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAErrorFn("memcpy to CPU failed!"); + + // Cleanup memory + cudaFree(dev_scan_input); + cudaFree(dev_scan_output); + checkCUDAErrorFn("cudaFree failed!"); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..547697c 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,20 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + // Create device_vectors from host_vectors + thrust::device_vector dev_in(idata, idata + n); + thrust::device_vector dev_out(odata, odata + n); + + timer().startGpuTimer(); + thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin()); timer().endGpuTimer(); + + // Write final results + cudaMemcpy(odata, dev_out.data().get(), n * sizeof(int), cudaMemcpyDeviceToHost); } } }