diff --git a/INSTRUCTION.md b/INSTRUCTION.md
index 779f144..ccb1ed1 100644
--- a/INSTRUCTION.md
+++ b/INSTRUCTION.md
@@ -29,8 +29,8 @@ on the implementation of scan and stream compaction.
 * The [slides on Parallel Algorithms](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p126)
   for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
 * GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html).
-    - This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.)
-    - We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). If you find more errors in the chapter, welcome to open new pull requests to contribute.
+  - This online version contains a few small errors (in superscripting, missing braces, bad indentation, etc.)
+  - We maintain a fix for this at [GPU Gem 3 Ch 39 Patch](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#gpu-gem-3-ch-39-patch). If you find more errors in the chapter, welcome to open new pull requests to contribute.
 * If you are still unclear after reading the steps, take a look at the last chapter - [Algorithm Examples](https://github.com/CIS565-Fall-2017/Project2-Stream-Compaction/blob/master/INSTRUCTION.md#algorithm-examples).
 * [Recitation slides](https://docs.google.com/presentation/d/1daOnWHOjMp1sIqMdVsNnvEU1UYynKcEMARc_W6bGnqE/edit?usp=sharing)
 
@@ -116,8 +116,9 @@ Most of the text in Part 2 applies.
 Since the work-efficient scan operates on a binary tree structure, it works
 best with arrays with power-of-two length. Make sure your implementation works
 on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
+
 - your intermediate array sizes will need to be rounded to the next power of
-two.
+  two.
 
 ### 3.2. Stream Compaction
 
@@ -152,13 +153,12 @@ For thrust stream compaction, take a look at [thrust::remove_if](https://thrust.
 
 ## Part 5: Why is My GPU Approach So Slow? (Extra Credit) (+5)
 
-If you implement your efficient scan version following the slides closely, there's a good chance
-that you are getting an "efficient" gpu scan that is actually not that efficient -- it is slower than the cpu approach?
+If you implement your efficient scan version following the slides closely, there's a good chance that you are getting an "efficient" gpu scan that is actually not that efficient -- it is slower than the cpu approach?
 
-Though it is totally acceptable for this assignment,
-In addition to explain the reason of this phenomena, you are encouraged to try to upgrade your work-efficient gpu scan.
+Though it is totally acceptable for this assignment, In addition to explain the reason of this phenomena, you are encouraged to try to upgrade your work-efficient gpu scan.
 
 Thinking about these may lead you to an aha moment:
+
 - What is the occupancy at a deeper level in the upper/down sweep? Are most threads actually working?
 - Are you always launching the same number of blocks throughout each level of the upper/down sweep?
 - If some threads are being lazy, can we do an early termination on them?
@@ -199,11 +199,13 @@ Always profile with Release mode builds and run without debugging.
 
 * Roughly optimize the block sizes of each of your implementations for minimal
   run time on your GPU.
+  
   * (You shouldn't compare unoptimized implementations to each other!)
 
 * Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
   Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
   (with array size on the independent axis).
+  
   * We wrapped up both CPU and GPU timing functions as a performance timer class for you to conveniently measure the time cost.
     * We use `std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance.
     * For CPU, put your CPU code between `timer().startCpuTimer()` and `timer().endCpuTimer()`.
@@ -215,11 +217,13 @@ Always profile with Release mode builds and run without debugging.
     even looking at the code for the implementation.
 
 * Write a brief explanation of the phenomena you see here.
+  
   * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
     it different for each implementation?
 
 * Paste the output of the test program into a triple-backtick block in your
   README.
+  
   * If you add your own tests (e.g. for radix sort or to test additional corner
     cases), be sure to mention it explicitly.
 
@@ -238,24 +242,24 @@ The template of the comment section of your pull request is attached below, you
 
 * [Repo Link](https://link-to-your-repo)
 * (Briefly) Mentions features that you've completed. Especially those bells and whistles you want to highlight
-    * Feature 0
-    * Feature 1
-    * ...
+  * Feature 0
+  * Feature 1
+  * ...
 * Feedback on the project itself, if any.
 
 ## GPU Gem 3 Ch 39 Patch
 
 * Example 1
-![](img/example-1.png)
+  ![](img/example-1.png)
 
 * Example 2
-![](img/example-2.jpg)
+  ![](img/example-2.jpg)
 
 * Figure-39-4
-![](img/figure-39-4.jpg)
+  ![](img/figure-39-4.jpg)
 
 * Figure-39-2. This image shows an naive inclusive scan. We should convert this to an exclusive one for compaction.
-![](img/figure-39-2.jpg)
+  ![](img/figure-39-2.jpg)
 
 ## Algorithm Examples
 
@@ -284,24 +288,24 @@ The template of the comment section of your pull request is attached below, you
       + output
         - [1 1 0 1 1 0 1]
     - scan
-        + take the output of last step as input
-        + input
+      + take the output of last step as input
+      + input
+        - [1 1 0 1 1 0 1]
+      + output
+        - [0 1 2 2 3 4 4]
+    - scatter
+      + preserve non-zero elements and compact them into a new array
+      + input:
+        + original array
+          - [1 5 0 1 2 0 3]
+        + mapped array
           - [1 1 0 1 1 0 1]
-        + output
+        + scanned array
           - [0 1 2 2 3 4 4]
-    - scatter
-        + preserve non-zero elements and compact them into a new array
-        + input:
-          + original array
-            - [1 5 0 1 2 0 3]
-          + mapped array
-            - [1 1 0 1 1 0 1]
-          + scanned array
-            - [0 1 2 2 3 4 4]
-        + output:
-          - [1 5 1 2 3]
-        + This can be done in parallel on GPU
-        + You can try multi-threading on CPU if you want (not required and not our focus)
-        + for each element input[i] in original array
-          - if it's non-zero (given by mapped array)
-          - then put it at output[index], where index = scanned[i]
+      + output:
+        - [1 5 1 2 3]
+      + This can be done in parallel on GPU
+      + You can try multi-threading on CPU if you want (not required and not our focus)
+      + for each element input[i] in original array
+        - if it's non-zero (given by mapped array)
+        - then put it at output[index], where index = scanned[i]
diff --git a/README.md b/README.md
index 0e38ddb..b1ee0c7 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,184 @@
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
+
+- Alex Fu
+  - [LinkedIn](https://www.linkedin.com/in/alex-fu-b47b67238/)
+  - [Twitter](https://twitter.com/AlexFu8304)
+  - [Personal Website](https://thecger.com/)
+- Tested on: Windows 10, i7-10750H @ 2.60GHz, 16GB, GTX 3060 6GB
+
 CUDA Stream Compaction
 ======================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+## Features Implemented
+
+* CPU Scan and Compaction
+
+* Naive GPU Scan
+
+* Work-efficient GPU Scan
+
+* GPU Steam Compaction
+
+* Radix Sort (Extra Credit)
+  
+  * When input is:
+    
+    `{41, 17, 34, 0, 19, 24, 28, 8, 12, 14, 5, 45, 31, 27, 11, 41, 45, 42, 27, 36, 41, 4, 2, 3, 42, 32, 21, 16, 18, 45, 47, 26, 21, 38, 19, 12, 17, 49, 35, 44, 3, 11, 22, 33, 23, 14, 41, 11, 3, 18, 47, 44, 12, 7, 37, 9, 23, 41, 29, 28, 16, 35, 40, 0}`,
+    
+    the output is:
+    
+     `{0, 0, 2, 3, 3, 3, 4, 5, 7, 8, 9, 11, 11, 11, 12, 12, 12, 14, 14, 16, 16, 17, 17, 18, 18, 19, 19, 21, 21, 22, 23, 23, 24, 26, 27, 27, 28, 28, 29, 31, 32, 33, 34, 35, 35, 36, 37, 38, 40, 41, 41, 41, 41, 41, 42, 42, 44, 44, 45, 45, 45, 47, 47, 49}`.
+    
+     I also ran the comparison between my radix sort and `thrust::sort` (see [Example Output](#example-output) and [Performance Analysis](#radix-sort)).
+
+### Example Output
+
+</div>
+
+`SIZE` is $2^{20} = 1.05 \times 10 ^ 6$. The test size for non-power-of-two case is `SIZE - 3`. CUDA block size is 128.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  41  17  34   0  19  24  28   8  12  14   5  45  31 ...  20   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.5527ms    (std::chrono Measured)
+    [   0  41  58  92  92 111 135 163 171 183 197 202 247 ... 25683436 25683456 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.5991ms    (std::chrono Measured)
+    [   0  41  58  92  92 111 135 163 171 183 197 202 247 ... 25683337 25683375 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.607232ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.884736ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.359712ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.351232ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.16576ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.390144ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   3   2   0   1   0   2   2   2   0   1   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.5172ms    (std::chrono Measured)
+    [   1   3   2   1   2   2   2   1   1   1   3   1   3 ...   2   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.2528ms    (std::chrono Measured)
+    [   1   3   2   1   2   2   2   1   1   1   3   1   3 ...   2   2 ]
+    passed
+==== cpu compact with scan, power-of-two ====
+   elapsed time: 4.2481ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan, non-power-of-two ====
+   elapsed time: 4.5622ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.42496ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.42096ms    (CUDA Measured)
+    passed
+
+**********************
+** RADIX SORT TESTS **
+**********************
+    [  41  17  34   0  19  24  28   8  12  14   5  45  31 ...  20   0 ]
+==== thrust sort, power-of-two ====
+   elapsed time: 0.421664ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== thrust sort, non-power-of-two ====
+   elapsed time: 0.342016ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  49  49 ]
+==== radix sort, power-of-two ====
+   elapsed time: 20.2885ms    (CUDA Measured)
+    passed
+==== radix sort, non-power-of-two ====
+   elapsed time: 23.1066ms    (CUDA Measured)
+    passed
+```
+
+## Performance Analysis
+
+### Performance Impacted by Array Size
+
+CUDA block size is 128.
+
+#### Scan
+
+When array size is small, the CPU implementation is faster than GPU implementation and the fluctuation in GPU implementation time cost is small. When array size is larger than 2.62e5, both thrust function and my work-efficient implementation outperform my CPU implementation.
+
+![Scan Time Impacted by Array Size Power of Two](img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png)
+
+<!-- ![Scan Time Impacted by Array Size (Non Power of Two)](img/Scan Time Impacted by Array Size Non Power of Two.png) -->
+
+#### Compaction
+
+The situation is the same as scan and the turning point is 6.55e4. 
+
+![Compaction Time Impacted by Array Size Power of Two](img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png)
+
+<!-- ![Compaction Time Impacted by Array Size (Non Power of Two)](img/Compaction Time Impacted by Array Size Non Power of Two.png) -->
+
+#### Radix Sort
+
+My implementation of radix sort is very slower than thrust function. 
+
+![Sort Time Impacted by Array Size Power of Two](img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png)
+
+<!-- ![Sort Time Impacted by Array Size (Non Power of Two)](img/Sort Time Impacted by Array Size Non Power of Two.png) -->
+
+### Performance Impacted by CUDA Block Size
+
+`SIZE` is $2^{20} = 1.05 \times 10 ^ 6$.
+
+#### Scan
+
+![Scan Time Impacted by CUDA Block Size Power of Two](img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png)
+
+#### Compaction
+
+![Compaction Time Impacted by CUDA Block Size Power of Two](img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png)
+
+#### Radix Sort
+
+![Radix Sort Time Impacted by CUDA Block Size Power of Two](img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png)
+
+## Why is My GPU Approach So Slow?
+
+The optimization I made to the the work-efficient scan is to avoid Warp Partitioning by compressing the threads:
+
+![Threads Allocation of the Down Sweep Function](img/Threads_Allocation_of_the_Down_Sweep_Function.png)
+
+Due to time constraints, I haven't implemeted the shared memory part. I guess this is where the thurst function surpasses mine.
+
+My radix sort (6 bit)'s time cost is about 10 times as much as my work-efficient scan's. This matches my instinct because radix sort will repeate the scan function in each sort. However, I noticed that the time cost of thrust sort function is not that slower than its scan function. For instance, when array size is 65536, the thrust scan costs 0.04ms while sort costs 0.09ms. This drives me to think if there is more optimizations I can do on radix sort.
+
+One drawback of my radix sort I can recognize is that in order to compute `totalFalse`, I make two device-to-host memery copies:
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+```cpp
+int totalFalse;
+int lastNum;
+cudaMemcpy(&totalFalse, devFalse + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+cudaMemcpy(&lastNum, devInp + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+if ((lastNum & (1 << bit)) == 0) totalFalse += 1;
+```
 
-### (TODO: Your README)
+ I believe this can be optimized somehow.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Some mistakes I made
 
+* At first I put the for loop inside the kernel functions and used a `__syncthreads()` at the begining of each iteration. However, since `__syncthreads()` is block-wise, my result went wrong when the array size exceeded my block size. Then I put the for loop outside the kernel functions.
diff --git a/img/Compaction Time Impacted by Array Size Non Power of Two.png b/img/Compaction Time Impacted by Array Size Non Power of Two.png
new file mode 100644
index 0000000..29e5b0c
Binary files /dev/null and b/img/Compaction Time Impacted by Array Size Non Power of Two.png differ
diff --git a/img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png b/img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png
new file mode 100644
index 0000000..fe1bf9b
Binary files /dev/null and b/img/Compaction_Time_Impacted_by_Array_Size_Power_of_Two.png differ
diff --git a/img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png b/img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png
new file mode 100644
index 0000000..5952995
Binary files /dev/null and b/img/Compaction_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png differ
diff --git a/img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png b/img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png
new file mode 100644
index 0000000..dc6fabf
Binary files /dev/null and b/img/Radix_Sort_Impacted_by_CUDA_Block_Size_Power_of_Two.png differ
diff --git a/img/Scan Time Impacted by Array Size Non Power of Two.png b/img/Scan Time Impacted by Array Size Non Power of Two.png
new file mode 100644
index 0000000..e42f4b9
Binary files /dev/null and b/img/Scan Time Impacted by Array Size Non Power of Two.png differ
diff --git a/img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png b/img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png
new file mode 100644
index 0000000..1383c24
Binary files /dev/null and b/img/Scan_Time_Impacted_by_Array_Size_Power_of_Two.png differ
diff --git a/img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png b/img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png
new file mode 100644
index 0000000..cbefd56
Binary files /dev/null and b/img/Scan_Time_Impacted_by_CUDA_Block_Size_Power_of_Two.png differ
diff --git a/img/Sort Time Impacted by Array Size Non Power of Two.png b/img/Sort Time Impacted by Array Size Non Power of Two.png
new file mode 100644
index 0000000..b170795
Binary files /dev/null and b/img/Sort Time Impacted by Array Size Non Power of Two.png differ
diff --git a/img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png b/img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png
new file mode 100644
index 0000000..fa1e1ae
Binary files /dev/null and b/img/Sort_Time_Impacted_by_Array_Size_Power_of_Two.png differ
diff --git a/img/Threads_Allocation_of_the_Down_Sweep_Function.png b/img/Threads_Allocation_of_the_Down_Sweep_Function.png
new file mode 100644
index 0000000..91f483d
Binary files /dev/null and b/img/Threads_Allocation_of_the_Down_Sweep_Function.png differ
diff --git a/img/performance_analysis.xlsx b/img/performance_analysis.xlsx
new file mode 100644
index 0000000..9c08bed
Binary files /dev/null and b/img/performance_analysis.xlsx differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..8b85166 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,11 +13,12 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 4194304; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
+int *d = new int[SIZE];
 
 int main(int argc, char* argv[]) {
     // Scan tests
@@ -27,7 +28,7 @@ int main(int argc, char* argv[]) {
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 50, 1);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -67,10 +68,23 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
-    zeroArray(SIZE, c);
+    //zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+
+    //for (int i = 0; i < 512; i++) {
+    //    std::cout << a[i] << " ";
+    //}
+    //std::cout << std::endl;
+    //std::cout << std::endl;
+    //
+    //for (int i = 0; i < 512; i++) {
+    //    std::cout << c[i] << " ";
+    //}
+    //std::cout << std::endl;   
+    //std::cout << std::endl;
+    // 
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
@@ -102,7 +116,7 @@ int main(int argc, char* argv[]) {
 
     // Compaction tests
 
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 4, 1);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -127,12 +141,19 @@ int main(int argc, char* argv[]) {
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
+    printDesc("cpu compact with scan, power-of-two");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("cpu compact with scan, non-power-of-two");
+    count = StreamCompaction::CPU::compactWithScan(NPOT, c, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    //printArray(count, c, true);
+    printCmpLenResult(count, expectedNPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
@@ -147,8 +168,142 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
-    system("pause"); // stop Win32 console from closing on exit
-    delete[] a;
-    delete[] b;
-    delete[] c;
+
+    printf("\n");
+    printf("**********************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("**********************\n");
+
+    genArray(SIZE - 1, a, 50, 1);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("thrust sort, power-of-two");
+    StreamCompaction::Thrust::sort(SIZE, b, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, b, true);
+
+    zeroArray(SIZE, d);
+    printDesc("thrust sort, non-power-of-two");
+    StreamCompaction::Thrust::sort(NPOT, d, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(NPOT, d, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, power-of-two");
+    StreamCompaction::Efficient::radixSort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    //for (int i = 0; i < 64; i++) {
+    //    std::cout << a[i] << " ";
+    //}
+    //std::cout << std::endl;
+    //for (int i = 0; i < 64; i++) {
+    //    std::cout << c[i] << " ";
+    //}
+    //std::cout << std::endl;
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, non-power-of-two");
+    StreamCompaction::Efficient::radixSort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(NPOT, c, true);
+    printCmpResult(NPOT, d, c);
+
+    //genArray(SIZE - 1, a, 50, 1);
+    //a[SIZE - 1] = 0;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::CPU::scan(SIZE, c, a);
+    //std::cout << StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Thrust::scan(SIZE, c, a);
+    //std::cout << StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Naive::scan(SIZE, c, a);
+    //std::cout << StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Efficient::scan(SIZE, c, a);
+    //std::cout << StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+
+    //std::cout << std::endl;
+    //std::cout << std::endl;
+
+    //genArray(SIZE - 1, a, 50, 1);
+    //a[SIZE - 1] = 0;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Efficient::radixSort(SIZE, c, a);
+    //std::cout << StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Thrust::sort(SIZE, c, a);
+    //std::cout << StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+
+    //std::cout << std::endl;
+    //std::cout << std::endl;
+
+    //genArray(SIZE - 1, a, 4, 1);  // Leave a 0 at the end to test that edge case
+    //a[SIZE - 1] = 0;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::CPU::compactWithoutScan(SIZE, c, a);
+    //std::cout << StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::CPU::compactWithScan(SIZE, c, a);
+    //std::cout << StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Efficient::compact(SIZE, c, a);
+    //std::cout << StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+
+    //std::cout << std::endl;
+    //std::cout << "Non power of two." << std::endl;    genArray(SIZE - 1, a, 50, 1);
+    //std::cout << std::endl;
+
+    //genArray(SIZE - 1, a, 50, 1);
+    //a[SIZE - 1] = 0;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::CPU::scan(NPOT, c, a);
+    //std::cout << StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Thrust::scan(NPOT, c, a);
+    //std::cout << StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Naive::scan(NPOT, c, a);
+    //std::cout << StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Efficient::scan(NPOT, c, a);
+    //std::cout << StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+
+    //std::cout << std::endl;
+    //std::cout << std::endl;
+
+    //genArray(SIZE - 1, a, 50, 1);
+    //a[SIZE - 1] = 0;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Efficient::radixSort(NPOT, c, a);
+    //std::cout << StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Thrust::sort(NPOT, c, a);
+    //std::cout << StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+
+    //std::cout << std::endl;
+    //std::cout << std::endl;
+
+    //genArray(SIZE - 1, a, 4, 1);  // Leave a 0 at the end to test that edge case
+    //a[SIZE - 1] = 0;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+    //std::cout << StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::CPU::compactWithScan(NPOT, c, a);
+    //std::cout << StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation() << std::endl;
+    //zeroArray(SIZE, c);
+    //StreamCompaction::Efficient::compact(NPOT, c, a);
+    //std::cout << StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation() << std::endl;
+
+    //system("pause"); // stop Win32 console from closing on exit
+    //delete[] a;
+    //delete[] b;
+    //delete[] c;
+    //delete[] d;
 }
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index 025e94a..0f5440e 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -57,6 +57,14 @@ void genArray(int n, int *a, int maxval) {
     }
 }
 
+void genArray(int n, int *a, int maxval, int seed) {
+    srand(seed);
+
+    for (int i = 0; i < n; i++) {
+        a[i] = rand() % maxval;
+    }
+}
+
 void printArray(int n, int *a, bool abridged = false) {
     printf("    [ ");
     for (int i = 0; i < n; i++) {
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..71ca973 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
 #include "common.h"
+#include <device_launch_parameters.h>
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -23,16 +24,20 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+            bools[idx] = idata[idx] > 0 ? 1 : 0;
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
+        __global__ void kernScatter(
+            int n, int *odata, const int *idata, /*const int *bools, */const int *indices) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n || idata[idx] == 0) return;
+            odata[indices[idx]] = idata[idx];
         }
 
     }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed..367fbad 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -13,6 +13,8 @@
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
+#define BLOCK_SIZE 128
+
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */
@@ -34,8 +36,8 @@ namespace StreamCompaction {
     namespace Common {
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
 
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices);
+        __global__ void kernScatter(
+            int n, int *odata, const int *idata, /*const int *bools, */const int *indices);
 
         /**
         * This class is used for timing the performance
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..d8b3ce7 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,4 +1,6 @@
+#include <iostream>
 #include <cstdio>
+#include <memory>
 #include "cpu.h"
 
 #include "common.h"
@@ -17,10 +19,14 @@ namespace StreamCompaction {
          * For performance analysis, this is supposed to be a simple for loop.
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
+        void scan(int n, int* odata, const int* idata, bool enableTimer) {
+            if(enableTimer) timer().startCpuTimer();
+            int sum = 0;
+            for (int i = 0; i < n; i++) {
+                odata[i] = sum;
+                sum += idata[i];
+            }
+            if (enableTimer) timer().endCpuTimer();
         }
 
         /**
@@ -28,11 +34,15 @@ namespace StreamCompaction {
          *
          * @returns the number of elements remaining after compaction.
          */
-        int compactWithoutScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
-            return -1;
+        int compactWithoutScan(int n, int *odata, const int *idata, bool enableTimer) {
+            if (enableTimer) timer().startCpuTimer();
+            int count = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0)
+                    odata[count++] = idata[i];
+            }
+            if (enableTimer) timer().endCpuTimer();
+            return count;
         }
 
         /**
@@ -40,11 +50,20 @@ namespace StreamCompaction {
          *
          * @returns the number of elements remaining after compaction.
          */
-        int compactWithScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
-            return -1;
+        int compactWithScan(int n, int *odata, const int *idata, bool enableTimer) {
+            if (enableTimer) timer().startCpuTimer();
+            std::unique_ptr<int[]> booleans{ new int[n] };
+            std::unique_ptr<int[]> scanResult{ new int[n] };
+            for (int i = 0; i < n; i++) {
+                booleans[i] = idata[i] != 0 ? 1 : 0;
+            }
+            scan(n, scanResult.get(), booleans.get(), false);
+            for (int i = 0; i < n; i++) {
+                if (booleans[i] == 1)
+                    odata[scanResult[i]] = idata[i];
+            }
+            if (enableTimer)  timer().endCpuTimer();
+            return booleans[n - 1] > 0 ? scanResult[n - 1] + 1 : scanResult[n - 1];
         }
     }
 }
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c047..a4b7fb2 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -6,10 +6,10 @@ namespace StreamCompaction {
     namespace CPU {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool enableTimer = true);
 
-        int compactWithoutScan(int n, int *odata, const int *idata);
+        int compactWithoutScan(int n, int *odata, const int *idata, bool enableTimer = true);
 
-        int compactWithScan(int n, int *odata, const int *idata);
+        int compactWithScan(int n, int *odata, const int *idata, bool enableTimer = true);
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..364ccc3 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -1,4 +1,8 @@
+#include <iostream>
+#include <memory>
 #include <cuda.h>
+#include <device_launch_parameters.h>
+#include <device_functions.h>
 #include <cuda_runtime.h>
 #include "common.h"
 #include "efficient.h"
@@ -6,19 +10,66 @@
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
+        PerformanceTimer& timer() {
             static PerformanceTimer timer;
             return timer;
         }
-
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+        __global__ void kernUpSweep(int n, int num, int offset, int* inp) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= num) return;
+            int idxWrite = offset * 2 * (idx + 1) - 1;
+            inp[idxWrite] = inp[idxWrite] + inp[idxWrite - offset];
+        }
+
+        __global__ void kernDownSweep(int n, int num, int offset, int* inp) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= num) return;
+            int idxWrite = n - 1 - idx * offset * 2;
+            int tmp = inp[idxWrite];
+            inp[idxWrite] += inp[idxWrite - offset];
+            inp[idxWrite - offset] = tmp;
+        }
+
+        void scan(int origN, int *odata, const int *idata, bool enableTimer) {
+            int* devInp;
+            int log2n = ilog2ceil(origN);
+
+            int n = pow(2, log2n);
+            cudaMalloc((void**)&devInp, n * sizeof(int));
+            checkCUDAError("cudaMalloc devInp failed!");
+            cudaMemcpy(devInp, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata failed!");
+
+            if (enableTimer) timer().startGpuTimer();
+
+            // up sweep
+            int num = n / 2;
+            for (int d = 0; d < log2n; d++) {
+                int offset = 1 << d;
+                dim3 fullBlocksPerGrid((num + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kernUpSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(n, num, offset, devInp);
+                num /= 2;
+            }
+            cudaMemset(devInp+n-1, 0, sizeof(int));
+
+            // down sweep
+            int offset = n / 2;
+            for (int d = 0; d < log2n; d++) {
+                int num = 1 << d;
+                dim3 fullBlocksPerGrid((num + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kernDownSweep<<<fullBlocksPerGrid, BLOCK_SIZE>>> (n, num, offset, devInp);
+                offset /= 2;
+            }
+
+            if (enableTimer) timer().endGpuTimer();
+
+            cudaMemcpy(odata, devInp, origN * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy odata failed!");
+            cudaFree(devInp);
+            checkCUDAError("cudaFree devInp failed!");
         }
 
         /**
@@ -30,11 +81,166 @@ namespace StreamCompaction {
          * @param idata  The array of elements to compact.
          * @returns      The number of elements remaining after compaction.
          */
-        int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-            return -1;
+
+
+        int compact(int origN, int *odata, const int *idata, bool enableTimer) {
+            int* devInp;
+            int* devBools;
+            int* devOut;
+
+            int log2n = ilog2ceil(origN);
+            int n = pow(2, log2n);
+
+            cudaMalloc((void**)&devInp, n * sizeof(int));
+            checkCUDAError("cudaMalloc devInp failed!");
+            cudaMalloc((void**)&devOut, n * sizeof(int));
+            checkCUDAError("cudaMalloc devOut failed!");
+            cudaMalloc((void**)&devBools, n * sizeof(int));
+            checkCUDAError("cudaMalloc devBools failed!");
+            cudaMemcpy(devInp, idata, origN * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata failed!");
+
+            if (enableTimer) timer().startGpuTimer();
+            dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+            Common::kernMapToBoolean<<<fullBlocksPerGrid, BLOCK_SIZE>>> (n, devBools, devInp);
+
+            // up sweep
+            int num = n / 2;
+            for (int d = 0; d < log2n; d++) {
+                int offset = 1 << d;
+                dim3 fullBlocksPerGrid((num + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kernUpSweep << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, num, offset, devBools);
+                num /= 2;
+            }
+            cudaMemset(devBools + n - 1, 0, sizeof(int));
+
+            // down sweep
+            int offset = n / 2;
+            for (int d = 0; d < log2n; d++) {
+                int num = 1 << d;
+                dim3 fullBlocksPerGrid((num + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kernDownSweep << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, num, offset, devBools);
+                offset /= 2;
+            }
+
+            Common::kernScatter<<<fullBlocksPerGrid, BLOCK_SIZE>>>(n, devOut, devInp, devBools);
+
+            if (enableTimer) timer().endGpuTimer();
+
+            cudaMemcpy(odata, devOut, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy odata failed!");
+            std::unique_ptr<int[]> indices{ new int[n] };
+            cudaMemcpy(indices.get(), devBools, origN * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy devBools failed!");
+            cudaFree(devInp);
+            checkCUDAError("cudaFree devInp failed!");
+            cudaFree(devOut);
+            checkCUDAError("cudaFree devInp failed!");
+            cudaFree(devBools);
+            checkCUDAError("cudaFree devBools failed!");
+            //for (int i = 0; i < 32; i++) {
+            //    std::cout << indices[i] << " ";
+            //}
+            //std::cout << std::endl;
+            return idata[origN - 1] != 0 ? indices[origN - 1] + 1 : indices[origN - 1];
+        }
+
+        // radix sort
+
+        __global__ void kernCheckBit(int n, int bit, int* inp, int* booleans, int* invertBooleans) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+            int boolean;
+            //if (inp[idx] == 0) boolean = 0; // +0 and -0
+            /*else*/ boolean = (inp[idx] & (1 << bit)) == 0 ? 0 : 1;
+            booleans[idx] = boolean;
+            invertBooleans[idx] = boolean == 0 ? 1 : 0;
+        }
+
+        __global__ void kernComputeIndices(int n, int totalFalse, int* scannedFalse, int* booleans, int* out) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+            if (booleans[idx] == 1)
+                out[idx] = idx - scannedFalse[idx] + totalFalse;
+            else
+                out[idx] = scannedFalse[idx];
+        }
+
+        __global__ void kernRadixSortScatter(int n, int* indices, int* inp, int* out) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+            out[indices[idx]] = inp[idx];
+        }
+
+        void radixSort(int n, int* out, const int* inp, bool enableTimer) {
+            int* devInp;
+            int* devTrue;
+            int* devFalse;
+            int* devIndices;
+            int log2n = ilog2ceil(n);
+            int nForScan = pow(2, log2n);
+
+            cudaMalloc((void**)&devInp, n * sizeof(int));
+            checkCUDAError("cudaMalloc devInp failed!");
+            cudaMemcpy(devInp, inp, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata failed!");
+            cudaMalloc((void**)&devTrue, n * sizeof(int));
+            checkCUDAError("cudaMalloc devTrue failed!");
+            cudaMalloc((void**)&devIndices, n * sizeof(int));
+            checkCUDAError("cudaMalloc devIndices failed!");
+            cudaMalloc((void**)&devFalse, nForScan * sizeof(int)); // devFalse will be scanned
+            checkCUDAError("cudaMalloc devFalse failed!");
+
+            if (enableTimer) timer().startGpuTimer();
+
+            for (int bit = 0; bit < 6; bit++) {
+                dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kernCheckBit <<<fullBlocksPerGrid, BLOCK_SIZE>>> (n, bit, devInp, devTrue, devFalse);
+
+                {// scan devFalse
+                    int num = nForScan / 2;
+                    for (int d = 0; d < log2n; d++) {
+                        int offset = 1 << d;
+                        dim3 fullBlocksPerGrid((num + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                        kernUpSweep <<<fullBlocksPerGrid, BLOCK_SIZE >>> (nForScan, num, offset, devFalse);
+                        num /= 2;
+                    }
+                    cudaMemset(devFalse + nForScan - 1, 0, sizeof(int));
+
+                    // down sweep
+                    int offset = nForScan / 2;
+                    for (int d = 0; d < log2n; d++) {
+                        int num = 1 << d;
+                        dim3 fullBlocksPerGrid((num + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                        kernDownSweep <<<fullBlocksPerGrid, BLOCK_SIZE >>> (nForScan, num, offset, devFalse);
+                        offset /= 2;
+                    }
+                }
+
+                int totalFalse;
+                int lastNum;
+                cudaMemcpy(&totalFalse, devFalse + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+                cudaMemcpy(&lastNum, devInp + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+                if ((lastNum & (1 << bit)) == 0) totalFalse += 1;
+
+                kernComputeIndices <<<fullBlocksPerGrid, BLOCK_SIZE>>> (n, totalFalse, devFalse, devTrue, devIndices);
+
+                kernRadixSortScatter <<<fullBlocksPerGrid, BLOCK_SIZE>>> (n, devIndices, devInp, devTrue); //temporarily store output into devTrue buffer
+                std::swap(devInp, devTrue);
+            }
+
+            if (enableTimer) timer().endGpuTimer();
+
+            cudaMemcpy(out, devInp, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy devInp failed!");
+            cudaFree(devInp);
+            checkCUDAError("cudaFree devInp failed!");
+            cudaFree(devTrue);
+            checkCUDAError("cudaFree devTrue failed!");
+            cudaFree(devFalse);
+            checkCUDAError("cudaFree devFalse failed!");
+            cudaFree(devIndices);
+            checkCUDAError("cudaFree devIndices failed!");
         }
     }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..7012903 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -6,8 +6,10 @@ namespace StreamCompaction {
     namespace Efficient {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool enableTimer = true);
 
-        int compact(int n, int *odata, const int *idata);
+        int compact(int n, int *odata, const int *idata, bool enableTimer = true);
+
+        void radixSort(int n, int* out, const int* inp, bool enableTimer = true);
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..4ea8435 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -1,4 +1,6 @@
+#include <iostream>
 #include <cuda.h>
+#include <device_launch_parameters.h>
 #include <cuda_runtime.h>
 #include "common.h"
 #include "naive.h"
@@ -16,10 +18,66 @@ namespace StreamCompaction {
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+
+        __global__ void kernScanStep(int n, int offset, int* inp, int* out) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+            if (idx < (offset + 1)) out[idx] = inp[idx];
+            else {
+                int outValue = inp[idx - offset] + inp[idx];
+                out[idx] = outValue;
+            }
+        }
+
+        __global__ void kernScanFirstStep(int n, int* inp, int* out) {
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+            else if (idx == 0) out[0] = 0;
+            else if (idx == 1) out[1] = inp[0];
+            else out[idx] = inp[idx - 2] + inp[idx - 1];
+        }
+
+        void scan(int n, int *odata, const int *idata, bool enableTimer) {
+            int* devInp;
+            int* devOut;
+            cudaMalloc((void**)&devInp, n * sizeof(int));
+            checkCUDAError("cudaMalloc devInp failed!");
+            cudaMalloc((void**)&devOut, n * sizeof(int));
+            checkCUDAError("cudaMalloc devOut failed!");
+            cudaMemcpy(devInp, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata failed!");
+
+            if (enableTimer) timer().startGpuTimer();
+            dim3 fullBlocksPerGrid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+            kernScanFirstStep <<<fullBlocksPerGrid, BLOCK_SIZE >>>(n, devInp, devOut);
+            //cudaMemcpy(odata, devOut, n * sizeof(int), cudaMemcpyDeviceToHost);
+            //std::cout << "d0: ";
+            //for (int i = 0; i < 32; i++) {
+            //    std::cout << odata[i] << " ";
+            //}
+            //std::cout << std::endl;
+            std::swap(devInp, devOut);
+            for (int d = 1; d < ilog2ceil(n); d++) {
+                //launch n-1-2^d threads; offset = 2^d; idxOffset = 2^d+1;
+                int pow2d = pow(2, d);
+                //fullBlocksPerGrid = dim3((n-1-pow2d + BLOCK_SIZE - 1) / BLOCK_SIZE);
+                kernScanStep<<<fullBlocksPerGrid, BLOCK_SIZE>>>(n, pow2d, devInp, devOut);
+                //cudaMemcpy(odata, devOut, n * sizeof(int), cudaMemcpyDeviceToHost);
+                //std::cout << "d" << d << ": ";
+                //for (int i = 0; i < 32; i++) {
+                //    std::cout << odata[i] << " ";
+                //}
+                //std::cout << std::endl;
+                std::swap(devInp, devOut);
+            }
+            if (enableTimer) timer().endGpuTimer();
+
+            cudaMemcpy(odata, devInp, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy odata failed!");
+            cudaFree(devInp);
+            checkCUDAError("cudaFree devInp failed!");
+            cudaFree(devOut);
+            checkCUDAError("cudaFree devOut failed!");
         }
     }
 }
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb06..a2f1eb5 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -6,6 +6,6 @@ namespace StreamCompaction {
     namespace Naive {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool enableTimer = true);
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..bb1d01e 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -3,6 +3,7 @@
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/scan.h>
+#include <thrust/sort.h>
 #include "common.h"
 #include "thrust.h"
 
@@ -17,12 +18,42 @@ namespace StreamCompaction {
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
+        void scan(int n, int *odata, const int *idata, bool enableTimer) {
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
-            timer().endGpuTimer();
+
+            //int data[6] = { 1, 0, 2, 2, 1, 3 };
+            //thrust::exclusive_scan(thrust::host, data, data + 6, data, 4); // in-place scan
+            //// data is now {4, 5, 5, 7, 9, 10}
+
+            thrust::device_vector<int> dInpVec(idata, idata + n);
+            thrust::device_vector<int> dOutVec(n);
+            thrust::host_vector<int> hOutVec(n);
+
+            if (enableTimer) timer().startGpuTimer();
+
+            thrust::exclusive_scan(dInpVec.begin(), dInpVec.end(), dOutVec.begin(), 0);
+
+            if (enableTimer) timer().endGpuTimer();
+
+            thrust::copy(dOutVec.begin(), dOutVec.end(), hOutVec.begin());
+            std::memcpy(odata, hOutVec.data(), n * sizeof(int));
+        }
+
+
+        void sort(int n, int* odata, const int* idata, bool enableTimer) {
+            thrust::device_vector<int> dInpVec(idata, idata + n);
+            thrust::host_vector<int> hOutVec(n);
+
+            if (enableTimer) timer().startGpuTimer();
+
+            thrust::sort(dInpVec.begin(), dInpVec.end());
+
+            if (enableTimer) timer().endGpuTimer();
+
+            thrust::copy(dInpVec.begin(), dInpVec.end(), hOutVec.begin());
+            std::memcpy(odata, hOutVec.data(), n * sizeof(int));
         }
     }
 }
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index fe98206..37c37f7 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -6,6 +6,7 @@ namespace StreamCompaction {
     namespace Thrust {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int *odata, const int *idata, bool enableTimer = true);
+        void sort(int n, int *odata, const int *idata, bool enableTimer = true);
     }
 }