diff --git a/INSTRUCTION.md b/INSTRUCTION.md
index a68600b..efb8985 100644
--- a/INSTRUCTION.md
+++ b/INSTRUCTION.md
@@ -63,7 +63,7 @@ In the Boids flocking simulation, particles representing birds or fish
 1. cohesion - boids move towards the perceived center of mass of their neighbors
 2. separation - boids avoid getting to close to their neighbors
 3. alignment - boids generally try to move with the same direction and speed as
-their neighbors
+   their neighbors
 
 These three rules specify a boid's velocity change in a timestep.
 At every timestep, a boid thus has to look at each of its neighboring boids
@@ -127,6 +127,7 @@ function rule3(Boid boid)
     return perceived_velocity * rule3Scale
 end
 ```
+
 Based on [Conard Parker's notes](http://www.vergenet.net/~conrad/boids/pseudocode.html) with slight adaptations. For the purposes of an interesting simulation,
 we will say that two boids only influence each other according if they are
 within a certain **neighborhood distance** of each other.
@@ -141,10 +142,13 @@ For an idea of how the simulation "should" look in 3D,
 **Please Note** that our pseudocode, our 2D implementation, and our reference code (from which we derived the parameters that ship with the basecode) differ from Conrad Parker's notes in Rule 3 - our references do not subtract the boid's own velocity from the perceived velocity:
 
 Our pseuodocode:
+
 ```
     return perceived_velocity * rule3Scale
 ```
+
 Conrad Parker's notes:
+
 ```
     RETURN (pvJ - bJ.velocity) / 8
 ```
@@ -159,12 +163,12 @@ However, since the purpose of this assignment is to introduce you to CUDA, we re
 
 * `src/main.cpp`: Performs all of the CUDA/OpenGL setup and OpenGL
   visualization.
+
 * `src/kernel.cu`: CUDA device functions, state, kernels, and CPU functions for
   kernel invocations. In place of a unit testing/sandbox framework, there is
   space in here for individually running your kernels and getting the output
   back from the GPU before running the actual simulation. PLEASE make use of
   this in Part 2 to individually test your kernels.
-
 1. Search the code for `TODO-1.2` and `LOOK-1.2`.
    * `src/kernel.cu`: Use what you learned in the first lectures to
      figure out how to resolve these X Part 1 TODOs.
@@ -205,7 +209,7 @@ because:
 
 1. We don't have resizeable arrays on the GPU
 2. Naively parallelizing the iteration may lead to race conditions, where two
-particles need to be written into the same bucket on the same clock cycle.
+   particles need to be written into the same bucket on the same clock cycle.
 
 Instead, we will construct the uniform grid by sorting. If we label each boid
 with an index representing its enclosing cell and then sort the list of
@@ -227,13 +231,14 @@ homework, we will use the value/key sort built into **Thrust**. See
 `Boids::unitTest` in `kernel.cu` for an example of how to use this.
 
 Your uniform grid will probably look something like this in GPU memory:
+
 - `dev_particleArrayIndices` - buffer containing a pointer for each boid to its
-data in dev_pos and dev_vel1 and dev_vel2
+  data in dev_pos and dev_vel1 and dev_vel2
 - `dev_particleGridIndices` - buffer containing the grid index of each boid
 - `dev_gridCellStartIndices` - buffer containing a pointer for each cell to the
-beginning of its data in `dev_particleArrayIndices`
+  beginning of its data in `dev_particleArrayIndices`
 - `dev_gridCellEndIndices` - buffer containing a pointer for each cell to the
-end of its data in `dev_particleArrayIndices`.
+  end of its data in `dev_particleArrayIndices`.
 
 Here the term `pointer` when used with buffers is largely interchangeable with
 the term `index`, however, you will effectively be using array indices as
@@ -245,6 +250,7 @@ You can toggle between different timestep update modes using the defines in
 `main.cpp`.
 
 ### 2.2 Play around some more
+
 Compare your uniform grid velocity update to your naive velocity update.
 In the typical case, the uniform grid version should be considerably faster.
 Try to push the limits of how many boids you can simulate.
@@ -252,6 +258,7 @@ Try to push the limits of how many boids you can simulate.
 Change the cell width of the uniform grid to be the neighborhood distance, instead of twice the neighborhood distance. Now, 27 neighboring cells will need to be checked for intersection. Does this increase or decrease the efficiency of the flocking?
 
 ### 2.3 Cutting out the middleman
+
 Consider the uniform grid neighbor search outlined in 2.1: pointers to boids in
 a single cell are contiguous in memory, but the boid data itself (velocities and
 positions) is scattered all over the place. Try rearranging the boid data
@@ -266,6 +273,7 @@ See the TODOs for Part 2.3. This should involve a slightly modified copy of
 your code from 2.1.
 
 ## Part 3: Performance Analysis
+
 For this project, we will guide you through your performance analysis with some
 basic questions. In the future, you will guide your own performance analysis -
 but these simple questions will always be critical to answer. In general, we
@@ -277,9 +285,10 @@ metric, but adding your own `cudaTimer`s, etc., will allow you to do more
 fine-grained benchmarking of various parts of your code.
 
 REMEMBER:
+
 * Do your performance testing in `Release` mode!
 * Turn off Vertical Sync in Nvidia Control Panel:
-![Unlock FPS](images/UnlockFPS.png)
+  ![Unlock FPS](images/UnlockFPS.png)
 * Performance should always be measured relative to some baseline when
   possible. A GPU can make your program faster - but by how much?
 * If a change impacts performance, show a comparison. Describe your changes.
@@ -289,6 +298,7 @@ REMEMBER:
 ### Questions
 
 There are two ways to measure performance:
+
 * Disable visualization so that the framerate reported will be for the the
   simulation only, and not be limited to 60 fps. This way, the framerate
   reported in the window title will be useful.
@@ -304,27 +314,26 @@ hypotheses and insights.
 **Answer these:**
 
 * For each implementation, how does changing the number of boids affect
-performance? Why do you think this is?
+  performance? Why do you think this is?
 * For each implementation, how does changing the block count and block size
-affect performance? Why do you think this is?
+  affect performance? Why do you think this is?
 * For the coherent uniform grid: did you experience any performance improvements
-with the more coherent uniform grid? Was this the outcome you expected?
-Why or why not?
+  with the more coherent uniform grid? Was this the outcome you expected?
+  Why or why not?
 * Did changing cell width and checking 27 vs 8 neighboring cells affect performance?
-Why or why not? Be careful: it is insufficient (and possibly incorrect) to say
-that 27-cell is slower simply because there are more cells to check!
+  Why or why not? Be careful: it is insufficient (and possibly incorrect) to say
+  that 27-cell is slower simply because there are more cells to check!
 
 **NOTE: Nsight performance analysis tools *cannot* presently be used on the lab
 computers, as they require administrative access.** If you do not have access
 to a CUDA-capable computer, the lab computers still allow you to do timing
 mesasurements! However, the tools are very useful for performance debugging.
 
-
 ## Part 4: Write-up
 
 1. Take a screenshot of the boids **and** use a gif tool like [licecap](http://www.cockos.com/licecap/) to record an animations of the boids with a fixed camera.
-Put this at the top of your README.md. Take a look at [How to make an attractive
-GitHub repo](https://github.com/pjcozzi/Articles/blob/master/CIS565/GitHubRepo/README.md).
+   Put this at the top of your README.md. Take a look at [How to make an attractive
+   GitHub repo](https://github.com/pjcozzi/Articles/blob/master/CIS565/GitHubRepo/README.md).
 2. Add your performance analysis. Graphs to include:
 - Framerate change with increasing # of boids for naive, scattered uniform grid, and coherent uniform grid (with and without visualization)
 - Framerate change with increasing block size
@@ -340,26 +349,26 @@ The template of the comment section of your pull request is attached below, you
 
 * [Repo Link](https://link-to-your-repo)
 * (Briefly) Mentions features that you've completed. Especially those bells and whistles you want to highlight
-    * Feature 0
-    * Feature 1
-    * ...
+  * Feature 0
+  * Feature 1
+  * ...
 * Feedback on the project itself, if any.
 
-
 And you're done!
 
 ## Tips
+
 - If your simulation crashes before launch, use
-`checkCUDAErrorWithLine("message")` after CUDA invocations
+  `checkCUDAErrorWithLine("message")` after CUDA invocations
 - `ctrl + f5` in Visual Studio will launch the program but won't let the window
-close if the program crashes. This way you can see any `checkCUDAErrorWithLine`
-output.
+  close if the program crashes. This way you can see any `checkCUDAErrorWithLine`
+  output.
 - For debugging purposes, you can transfer data to and from the GPU.
-See `Boids::unitTest` in `kernel.cu` for an example of how to use this.
+  See `Boids::unitTest` in `kernel.cu` for an example of how to use this.
 - For high DPI displays like 4K monitors or the Macbook Pro with Retina Display, you might want to double the rendering resolution and point size. See `main.hpp`.
 - Your README.md will be done in github markdown. You can find a [cheatsheet here](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf). There is
-also a [live preview plugin](https://atom.io/packages/markdown-preview) for the
-[atom text editor](https://atom.io/) from github. The same for [VS Code](https://www.visualstudio.com/en-us/products/code-vs.aspx)
+  also a [live preview plugin](https://atom.io/packages/markdown-preview) for the
+  [atom text editor](https://atom.io/) from github. The same for [VS Code](https://www.visualstudio.com/en-us/products/code-vs.aspx)
 - If your framerate is capped at 60fps, [disable V-sync](http://support.enmasse.com/tera/enable-v-sync-to-fix-graphics-issues-screen-tearing)
 
 ## Optional Extra Credit
diff --git a/README.md b/README.md
index d63a6a1..16fc902 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,94 @@
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
-Project 1 - Flocking**
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Alex Fu
+  * [LinkedIn](https://www.linkedin.com/in/alex-fu-b47b67238/)
+  * [Twitter](https://twitter.com/AlexFu8304)
+  * [Personal Website](https://thecger.com/)
+* Tested on: Windows 10, i7-10750H @ 2.60GHz, 16GB, GTX 3060 6GB
 
-### (TODO: Your README)
+# CIS 565 - Project 1: Flocking
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Results
+
+### Brute force, 50,000 boids, 30+ fps
+
+![Simulation of 50000 boids](./images/1.2-50000.gif)
+
+### Scattered grids, 50,000 boids, 144+ fps
+
+![Simulation of 50000 boids](./images/2.1-50000.gif)
+
+### Scattered grids, 1,000,000 boids, 14+ fps
+
+![Simulation of 50000 boids](./images/2.1-1000000.gif)
+
+### Coherent grids, 1,000,000 boids, 35+ fps
+
+![Simulation of 50000 boids](./images/2.3-1000000.gif)
+
+## Analysis
+
+I use the average FPS over 1-11 secs to represent the performance of the application. I have tested the impact on performance by the boids number, CUDA block size, grid cell size, and the searcing volume.
+
+### Average FPS Impacted by the Boids Number
+
+CUDA block sizes are all 128.
+
+#### Without Visulization
+
+![Average FPS Impacted by the Boids Number](./images/boids-num.png)
+
+#### With Visualization
+
+![Average FPS Impacted by the Boids Number](./images/boids-num-vis.png)
+
+### Average FPS Impacted by the CUDA Block Size
+
+The boids number is 500,000 for scattered and coherent grids, and 20,000 for brute force.
+
+![Average FPS Impacted by the Boids Number](./images/block-size.png)
+
+### Average FPS Impacted by the Number of Coherent Grids
+
+The boids number is 500,000 and the CUDA block size is 128.
+
+| Cell Width | Cell Number | Average FPS |
+|:----------:|:-----------:|:-----------:|
+| 10         | 10648       | 157.8       |
+| 15         | 2744        | 59.0        |
+| 20         | 1728        | 23.0        |
+| 40         | 216         | 7.1         |
+| 80         | 64          | 1.7         |
+
+![Average FPS Impacted by the Boids Number](./images/grid-cell-number.png)
+
+### Comparison Between Searching 8 and 27 grids
+
+Tested in coherent grids. The CUDA block size is 128.
+
+![Average FPS Impacted by the Boids Number](./images/27vs8.png)
+
+## Answers to the Questions
+
+* For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+  
+  * Generally, the more boids there are, the slower the program runs. However when boids are less than 20,000 this is contrary when using uniform grids — I guess it's because the boids are too scattered so the program will go over nearly every grid. Under that circumstance the I/O to the memory is close to the brute force and plus the extra `if...else...` branches, the performance may be worse than brute force.
+
+* For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+  
+  * To be honest, I haven't found any specific relation between block size and performance. But one thing for sure is that the size of 32 is the most disadvantageous to the performance on my machine.
+
+* For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+  
+  * Yes. If there are less uniform grids, which means the size of cell is larger, I believe the program will have to check more boids and the performance will get closer to the brute force.
+
+* Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+  
+  * Checking 27 neighboring cells is actually significantly faster than chekcing 8 neighboring cells. I think it's because: 1. when checking 27 neighboring cells, the cell width can be the half of those when checking 8, thus the search volume decrease to 0.421875 of the origin volume; 2. when checking 27 neighboring cells, there will be less `if...else...` branches (and it's more possible that all threads are in the same branch); recalling the knowledge of `warp`, branch statements will harm the performance. 
+  * Besides, checking 27 cells is much easier to code.
+
+## Feedback
+
+* There is an unsolved bug in coherent grids when boids size is small. See [my post in Ed Discussion](https://edstem.org/us/courses/28083/discussion/1757498).
+
+* At first all my boids would disapear quickly. It took me a while before I realize it's because some values were divided by zero. 
diff --git a/images/1.2-5000.png b/images/1.2-5000.png
new file mode 100644
index 0000000..5375c55
Binary files /dev/null and b/images/1.2-5000.png differ
diff --git a/images/1.2-50000.gif b/images/1.2-50000.gif
new file mode 100644
index 0000000..a902903
Binary files /dev/null and b/images/1.2-50000.gif differ
diff --git a/images/1.2-50000.png b/images/1.2-50000.png
new file mode 100644
index 0000000..e1dc74d
Binary files /dev/null and b/images/1.2-50000.png differ
diff --git a/images/2.1-1000000.gif b/images/2.1-1000000.gif
new file mode 100644
index 0000000..ca81ee2
Binary files /dev/null and b/images/2.1-1000000.gif differ
diff --git a/images/2.1-50000.gif b/images/2.1-50000.gif
new file mode 100644
index 0000000..9a9149a
Binary files /dev/null and b/images/2.1-50000.gif differ
diff --git a/images/2.3-1000000.gif b/images/2.3-1000000.gif
new file mode 100644
index 0000000..1be3613
Binary files /dev/null and b/images/2.3-1000000.gif differ
diff --git a/images/27vs8.png b/images/27vs8.png
new file mode 100644
index 0000000..3083fbe
Binary files /dev/null and b/images/27vs8.png differ
diff --git a/images/block-size.png b/images/block-size.png
new file mode 100644
index 0000000..659b355
Binary files /dev/null and b/images/block-size.png differ
diff --git a/images/boids-num-vis.png b/images/boids-num-vis.png
new file mode 100644
index 0000000..ca716ca
Binary files /dev/null and b/images/boids-num-vis.png differ
diff --git a/images/boids-num.png b/images/boids-num.png
new file mode 100644
index 0000000..2d905c8
Binary files /dev/null and b/images/boids-num.png differ
diff --git a/images/grid-cell-number.png b/images/grid-cell-number.png
new file mode 100644
index 0000000..e0cf03e
Binary files /dev/null and b/images/grid-cell-number.png differ
diff --git a/shaders/boid.vert.glsl b/shaders/boid.vert.glsl
index 57b0759..77da0e4 100644
--- a/shaders/boid.vert.glsl
+++ b/shaders/boid.vert.glsl
@@ -5,6 +5,6 @@ in vec4 Velocity;
 out vec4 vFragColorVs;
 
 void main() {
-    vFragColorVs = Velocity;
+    vFragColorVs = normalize(Velocity);
     gl_Position = Position;
 }
diff --git "a/src/kernel - \345\211\257\346\234\254.cu" "b/src/kernel - \345\211\257\346\234\254.cu"
new file mode 100644
index 0000000..aa0818e
--- /dev/null
+++ "b/src/kernel - \345\211\257\346\234\254.cu"	
@@ -0,0 +1,667 @@
+#define GLM_FORCE_CUDA
+#include <stdio.h>
+#include <cuda.h>
+#include <device_launch_parameters.h>
+#include <cmath>
+#include <glm/glm.hpp>
+#include <glm/gtc/epsilon.hpp>
+#include "utilityCore.hpp"
+#include "kernel.h"
+
+// LOOK-2.1 potentially useful for doing grid-based neighbor search
+#ifndef imax
+#define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
+#endif
+
+#ifndef imin
+#define imin( a, b ) ( ((a) < (b)) ? (a) : (b) )
+#endif
+
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
+
+/**
+* Check for CUDA errors; print and exit if there was a problem.
+*/
+void checkCUDAError(const char* msg, int line = -1) {
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess != err) {
+		if (line >= 0) {
+			fprintf(stderr, "Line %d: ", line);
+		}
+		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+		exit(EXIT_FAILURE);
+	}
+}
+
+
+/*****************
+* Configuration *
+*****************/
+
+/*! Block size used for CUDA kernel launch. */
+#define blockSize 128
+
+// LOOK-1.2 Parameters for the boids algorithm.
+// These worked well in our reference implementation.
+#define rule1Distance 5.0f
+#define rule2Distance 3.0f
+#define rule3Distance 5.0f
+
+#define rule1Scale 0.01f
+#define rule2Scale 0.1f
+#define rule3Scale 0.1f
+
+#define maxSpeed 3.0f
+#define epsilon 1e-4f
+
+/*! Size of the starting area in simulation space. */
+#define scene_scale 100.0f
+
+/***********************************************
+* Kernel state (pointers are device pointers) *
+***********************************************/
+
+int numObjects;
+dim3 threadsPerBlock(blockSize);
+
+// LOOK-1.2 - These buffers are here to hold all your boid information.
+// These get allocated for you in Boids::initSimulation.
+// Consider why you would need two velocity buffers in a simulation where each
+// boid cares about its neighbors' velocities.
+// These are called ping-pong buffers.
+glm::vec3* dev_pos;
+glm::vec3* dev_vel1;
+glm::vec3* dev_vel2;
+
+// LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
+// pointers on your own too.
+
+// For efficient sorting and the uniform grid. These should always be parallel.
+int* dev_boidArrayIndices; // What index in dev_pos and dev_velX represents this boid?
+int* dev_boidGridIndices; // What grid cell is this boid in?
+// needed for use with thrust
+std::unique_ptr<thrust::device_ptr<int>> dev_thrust_boidArrayIndices;
+std::unique_ptr<thrust::device_ptr<int>> dev_thrust_boidGridIndices;
+
+int* dev_gridCellStartIndices; // What part of dev_boidArrayIndices belongs
+int* dev_gridCellEndIndices;   // to this cell?
+
+// TODO-2.3 - consider what additional buffers you might need to reshuffle
+// the position and velocity data to be coherent within cells.
+
+// LOOK-2.1 - Grid parameters based on simulation parameters.
+// These are automatically computed for you in Boids::initSimulation
+int gridCellCount;
+int halfGridSideCount;
+int gridSideCount;
+float gridCellWidth;
+float halfGridCellWidth;
+float gridinverseCellWidth;
+glm::vec3 gridMinimum;
+
+/******************
+* initSimulation *
+******************/
+
+__host__ __device__ unsigned int hash(unsigned int a) {
+	a = (a + 0x7ed55d16) + (a << 12);
+	a = (a ^ 0xc761c23c) ^ (a >> 19);
+	a = (a + 0x165667b1) + (a << 5);
+	a = (a + 0xd3a2646c) ^ (a << 9);
+	a = (a + 0xfd7046c5) + (a << 3);
+	a = (a ^ 0xb55a4f09) ^ (a >> 16);
+	return a;
+}
+
+/**
+* LOOK-1.2 - this is a typical helper function for a CUDA kernel.
+* Function for generating a random vec3.
+*/
+__host__ __device__ glm::vec3 generateRandomVec3(float seed, int index, float scale = 1.f) {
+	thrust::default_random_engine rng(hash((int)(index * seed)));
+	if (scale < 0) scale = -scale;
+	thrust::uniform_real_distribution<float> unitDistrib(-scale, scale);
+
+	return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
+}
+
+/**
+* LOOK-1.2 - This is a basic CUDA kernel.
+* CUDA kernel for generating boids with a specified mass randomly around the star.
+*/
+__global__ void kernGenerateRandomArray(float seed, int N, glm::vec3* arr, float scale) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		arr[index] = generateRandomVec3(seed, index, scale);
+	}
+}
+
+/**
+* Initialize memory, update some globals
+*/
+void Boids::initSimulation(int N) {
+	numObjects = N;
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	// LOOK-1.2 - This is basic CUDA memory management and error checking.
+	// Don't forget to cudaFree in  Boids::endSimulation.
+	cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+	cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+	cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+
+	// LOOK-1.2 - This is a typical CUDA kernel invocation.
+	kernGenerateRandomArray <<<fullBlocksPerGrid, blockSize>>> (1.f, numObjects,
+		dev_pos, scene_scale);
+	checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+	kernGenerateRandomArray <<<fullBlocksPerGrid, blockSize>>> (1.f, numObjects,
+		dev_vel1, maxSpeed);
+	checkCUDAErrorWithLine("kernGenerateRandomVelArray failed!");
+
+	// LOOK-2.1 computing grid params
+	halfGridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+	gridCellWidth = 2.0f * halfGridCellWidth;
+	halfGridSideCount = (int)(scene_scale / gridCellWidth) + 1;
+	gridSideCount = 2 * halfGridSideCount;
+
+	gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+	gridinverseCellWidth = 1.0f / gridCellWidth;
+	float halfGridWidth = gridCellWidth * halfGridSideCount;
+	gridMinimum.x -= halfGridWidth;
+	gridMinimum.y -= halfGridWidth;
+	gridMinimum.z -= halfGridWidth;
+
+	// TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+	cudaMalloc((void**)&dev_boidArrayIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_boidArrayIndices failed!");
+	dev_thrust_boidArrayIndices = std::make_unique<thrust::device_ptr<int>>(dev_boidArrayIndices);
+
+	cudaMalloc((void**)&dev_boidGridIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_boidGridIndices failed!");
+	dev_thrust_boidGridIndices = std::make_unique<thrust::device_ptr<int>>(dev_boidGridIndices);
+
+	cudaMalloc((void**)&dev_gridCellStartIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellEndIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	cudaDeviceSynchronize();
+}
+
+
+/******************
+* copyBoidsToVBO *
+******************/
+
+/**
+* Copy the boid positions into the VBO so that they can be drawn by OpenGL.
+*/
+__global__ void kernCopyPositionsToVBO(int N, glm::vec3* pos, float* vbo, float s_scale) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	float c_scale = -1.0f / s_scale;
+
+	if (index < N) {
+		vbo[4 * index + 0] = pos[index].x * c_scale;
+		vbo[4 * index + 1] = pos[index].y * c_scale;
+		vbo[4 * index + 2] = pos[index].z * c_scale;
+		vbo[4 * index + 3] = 1.0f;
+	}
+}
+
+__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3* vel, float* vbo, float s_scale) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	if (index < N) {
+		vbo[4 * index + 0] = vel[index].x + 0.3f;
+		vbo[4 * index + 1] = vel[index].y + 0.3f;
+		vbo[4 * index + 2] = vel[index].z + 0.3f;
+		vbo[4 * index + 3] = 1.0f;
+	}
+}
+
+/**
+* Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
+*/
+void Boids::copyBoidsToVBO(float* vbodptr_positions, float* vbodptr_velocities) {
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernCopyPositionsToVBO <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos, vbodptr_positions, scene_scale);
+	kernCopyVelocitiesToVBO <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+
+	checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+
+	cudaDeviceSynchronize();
+}
+
+
+/******************
+* stepSimulation *
+******************/
+
+/**
+* LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
+* __device__ code can be called from a __global__ context
+* Compute the new velocity on the body with index `iSelf` due to the `N` boids
+* in the `pos` and `vel` arrays.
+*/
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3* pos, const glm::vec3* vel) {
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	// Rule 2: boids try to stay a distance d away from each other
+	// Rule 3: boids try to match the speed of surrounding boids
+
+	// https://vergenet.net/~conrad/boids/pseudocode.html
+	float num1 = 0.f, /*num2 = 0.f, */num3 = 0.f;
+	glm::vec3 thisPos = pos[iSelf];
+	glm::vec3 thisVel = vel[iSelf];
+	glm::vec3 rule1Center, rule2Vel, rule3Vel;
+	for (int i = 0; i < N; i++) {
+		if (i == iSelf) continue;
+		glm::vec3 thatPos = pos[i];
+		glm::vec3 thatVel = vel[i];
+		float dist = glm::distance(thatPos, thisPos);
+		if (dist < rule1Distance) {
+			num1 += 1.f;
+			rule1Center += thatPos;
+		}
+		if (dist < rule2Distance) {
+			rule2Vel -= thatPos - thisPos;
+		}
+		if (dist < rule3Distance) {
+			num3 += 1.f;
+			rule3Vel += thatVel;
+		}
+	}
+	glm::vec3 outVel;
+	if (!glm::epsilonEqual(num1, 0.f, epsilon)) outVel += (rule1Center / num1 - thisPos) * rule1Scale;
+	outVel += rule2Vel * rule2Scale;
+	if (!glm::epsilonEqual(num3, 0.f, epsilon)) outVel += (rule3Vel / num3/* - thisVel*/) * rule3Scale;
+	return outVel;
+}
+
+/**
+* TODO-1.2 implement basic flocking
+* For each of the `N` bodies, update its position based on its current velocity.
+*/
+__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3* pos,
+	glm::vec3* vel1, glm::vec3* vel2) {
+	// Compute a new velocity based on pos and vel1
+	// Clamp the speed
+	// Record the new velocity into vel2. Question: why NOT vel1?
+
+	int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (idx >= N) return;
+	glm::vec3 outSpeed = vel1[idx] + computeVelocityChange(N, idx, pos, vel1);
+
+	//if (glm::length(outSpeed) > maxSpeed)
+	//	vel2[idx] = glm::normalize(outSpeed) * maxSpeed;
+	//else vel2[idx] = outSpeed;
+	vel2[idx] = glm::clamp(outSpeed, -maxSpeed, maxSpeed); // faster but not accurate
+}
+
+/**
+* LOOK-1.2 Since this is pretty trivial, we implemented it for you.
+* For each of the `N` bodies, update its position based on its current velocity.
+*/
+__global__ void kernUpdatePos(int N, float dt, glm::vec3* pos, glm::vec3* vel) {
+	// Update position by velocity
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	thisPos += vel[index] * dt;
+
+	// Wrap the boids around so we don't lose them
+	thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+	thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+	thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+
+	thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+	thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+	thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+
+	pos[index] = thisPos;
+}
+
+// LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
+// LOOK-2.3 Looking at this method, what would be the most memory efficient
+//          order for iterating over neighboring grid cells?
+//          for(x)
+//            for(y)
+//             for(z)? Or some other order?
+__device__ int gridIndex3Dto1D(int x, int y, int z, int sideCount) {
+	return x + y * sideCount + z * sideCount * sideCount;
+}
+__device__ glm::ivec3 gridIndex1Dto3D(int id, int sideCount) {
+	int x = id % sideCount;
+	int y = ((id - x) / sideCount) % sideCount;
+	int z = (id - x - y * sideCount) / (sideCount * sideCount);
+	return glm::ivec3(x, y, z);
+}
+
+__global__ void kernComputeIndices(int N, int sideCount,
+	glm::vec3 gridMin, float inverseCellWidth,
+	glm::vec3* pos, int* boidArrayIndices, int* boidGridIndices) {
+	// TODO-2.1
+	// - Label each boid with the index of its grid cell.
+	// - Set up a parallel array of integer indices as pointers to the actual
+	//   boid data in pos and vel1/vel2
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	boidArrayIndices[idx] = idx;
+	glm::vec3 vec = pos[idx] - gridMin;
+	int x = (int)(vec.x * inverseCellWidth);
+	int y = (int)(vec.y * inverseCellWidth);
+	int z = (int)(vec.z * inverseCellWidth);
+	boidGridIndices[idx] = gridIndex3Dto1D(x, y, z, sideCount);
+}
+
+// LOOK-2.1 Consider how this could be useful for indicating that a cell
+//          does not enclose any boids
+__global__ void kernResetIntBuffer(int N, int* intBuffer, int value) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		intBuffer[index] = value;
+	}
+}
+
+__global__ void kernIdentifyCellStartEnd(int N, int* boidGridIndices,
+	int* gridCellStartIndices, int* gridCellEndIndices) {
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+
+	// must be called after the sort
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	if (idx == 0) {
+		gridCellStartIndices[boidGridIndices[0]] = 0;
+	}
+	else if (idx == N - 1) {
+		gridCellEndIndices[boidGridIndices[N - 1]] = N - 1;
+	}
+	else {
+		if (boidGridIndices[idx] != boidGridIndices[idx - 1])
+			gridCellStartIndices[boidGridIndices[idx]] = idx;
+		if (boidGridIndices[idx] != boidGridIndices[idx + 1])
+			gridCellEndIndices[boidGridIndices[idx]] = idx;
+	}
+}
+
+__device__ glm::vec3 gridComputeVelocityChange(
+	int idx, int idxTrue, int* boidArrayIndices, int searchNum, int* searchSpace, int * gridCellStartIndices, int* gridCellEndIndices, const glm::vec3* pos, const glm::vec3* vel) {
+	float num1 = 0.f, /*num2 = 0.f, */num3 = 0.f;
+	glm::vec3 thisPos = pos[idxTrue];
+	glm::vec3 thisVel = vel[idxTrue];
+	glm::vec3 rule1Center, rule2Vel, rule3Vel;
+	for (int i = 0; i < searchNum; i++) {
+		int cellIdx = searchSpace[i];
+		if (gridCellStartIndices[cellIdx] == -1) continue; //cell doesn't have boids
+		for (int ii = gridCellStartIndices[cellIdx]; ii <= gridCellEndIndices[cellIdx]; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	glm::vec3 outVel;
+	if (!glm::epsilonEqual(num1, 0.f, epsilon)) outVel += (rule1Center / num1 - thisPos) * rule1Scale;
+	outVel += rule2Vel * rule2Scale;
+	if (!glm::epsilonEqual(num3, 0.f, epsilon)) outVel += (rule3Vel / num3/* - thisVel*/) * rule3Scale;
+	return outVel;
+}
+
+__global__ void kernUpdateVelNeighborSearchScattered(
+	int N, int sideCount, int halfSideCount, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth, float halfCellWidth,
+	int* gridCellStartIndices, int* gridCellEndIndices,
+	int* boidArrayIndices, int* boidGridIndices,
+	glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+	// - Identify the grid cell that this boid is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	int idxTrue = boidArrayIndices[idx];
+	glm::vec3 thisPos = pos[idxTrue];
+	glm::vec3 thisVel = vel1[idxTrue];
+	int cellIdx = boidGridIndices[idx];
+	glm::ivec3 cellCoord = gridIndex1Dto3D(cellIdx, sideCount);
+	glm::vec3 vec = thisPos - gridMin;
+	float x = vec.x - cellWidth * cellCoord.x;
+	float y = vec.y - cellWidth * cellCoord.y;
+	float z = vec.z - cellWidth * cellCoord.z;
+
+	int searchX = 0;
+	int searchY = 0;
+	int searchZ = 0;
+	int searchSpace[8] = { -1 };
+	int searchSpacePtr = 0;
+
+	// add itself
+	searchSpace[searchSpacePtr++] = cellIdx;
+
+	if (x <  halfCellWidth && cellCoord.x > -halfSideCount) searchX = -1;
+	if (x >= halfCellWidth && cellCoord.x < (halfSideCount-1))  searchX = 1;
+	if (y <  halfCellWidth && cellCoord.y > -halfSideCount) searchY = -1;
+	if (y >= halfCellWidth && cellCoord.y < (halfSideCount-1))  searchY = 1;
+	if (z <  halfCellWidth && cellCoord.z > -halfSideCount) searchZ = -1;
+	if (z >= halfCellWidth && cellCoord.z < (halfSideCount-1))  searchZ = 1;
+
+	// fill search space
+	if (searchX != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x + searchX, y, z, sideCount);
+	if (searchX != 0 && searchY != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x + searchX, y + searchY, z, sideCount);
+	if (searchX != 0 && searchY != 0 && searchZ != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x + searchX, y + searchY, z + searchZ, sideCount);
+	if (searchX != 0 && searchZ != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x + searchX, y, z + searchZ, sideCount);
+	if (searchY != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x, y + searchY, z, sideCount);
+	if (searchZ != 0 && searchY != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x, y + searchY, z + searchZ, sideCount);
+	if (searchZ != 0)
+		searchSpace[searchSpacePtr++] = gridIndex3Dto1D(x, y, z + searchZ, sideCount);
+
+	glm::vec3 outSpeed = vel1[idx] + gridComputeVelocityChange(idx, idxTrue, boidArrayIndices, searchSpacePtr + 1, searchSpace, gridCellStartIndices, gridCellEndIndices, pos, vel1);
+
+	//if (glm::length(outSpeed) > maxSpeed)
+	//	vel2[idx] = glm::normalize(outSpeed) * maxSpeed;
+	//else vel2[idx] = outSpeed;
+	vel2[idx] = glm::clamp(outSpeed, -maxSpeed, maxSpeed); // faster but not accurate
+}
+
+__global__ void kernUpdateVelNeighborSearchCoherent(
+	int N, int sideCount, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int* gridCellStartIndices, int* gridCellEndIndices,
+	glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+	// - Identify the grid cell that this boid is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+}
+
+/**
+* Step the entire N-body simulation by `dt` seconds.
+*/
+void Boids::stepSimulationNaive(float dt) {
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	// TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
+}
+
+void Boids::stepSimulationScatteredGrid(float dt) {
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	// In Parallel:
+	// - label each boid with its array index as well as its grid index.
+	//   Use 2x width grids.
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// - Perform velocity updates using neighbor search
+	// - Update positions
+	// - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (
+		numObjects, 
+		gridSideCount, 
+		gridMinimum, 
+		gridinverseCellWidth, 
+		dev_pos, 
+		dev_boidArrayIndices, 
+		dev_boidGridIndices);
+
+	//int    keys[6] = { 1,   4,   2,   8,   5,   7 };
+	//char values[6] = { 'a', 'b', 'c', 'd', 'e', 'f' };
+	//thrust::sort_by_key(thrust::host, keys, keys + N, values);
+	// keys is now   {  1,   2,   4,   5,   7,   8}
+	// values is now {'a', 'c', 'b', 'e', 'f', 'd'}
+	thrust::sort_by_key(*dev_thrust_boidGridIndices, *dev_thrust_boidGridIndices + numObjects, *dev_thrust_boidArrayIndices);
+
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+	kernIdentifyCellStartEnd <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_boidGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (
+		numObjects, gridSideCount, halfGridSideCount, gridMinimum, gridinverseCellWidth, gridCellWidth, halfGridCellWidth, dev_gridCellStartIndices,
+		dev_gridCellEndIndices, dev_boidArrayIndices, dev_boidGridIndices, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize >>> (numObjects, dt, dev_pos, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
+}
+
+void Boids::stepSimulationCoherentGrid(float dt) {
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	// In Parallel:
+	// - Label each boid with its array index as well as its grid index.
+	//   Use 2x width grids
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the boid data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	// - Perform velocity updates using neighbor search
+	// - Update positions
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+}
+
+void Boids::endSimulation() {
+	cudaFree(dev_vel1);
+	cudaFree(dev_vel2);
+	cudaFree(dev_pos);
+
+	// TODO-2.1 TODO-2.3 - Free any additional buffers here.
+
+	cudaFree(dev_boidArrayIndices);
+	cudaFree(dev_boidGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+}
+
+void Boids::unitTest() {
+	// LOOK-1.2 Feel free to write additional tests here.
+
+	// test unstable sort
+	int* dev_intKeys;
+	int* dev_intValues;
+	int N = 10;
+
+	std::unique_ptr<int[]>intKeys{ new int[N] };
+	std::unique_ptr<int[]>intValues{ new int[N] };
+
+	intKeys[0] = 0; intValues[0] = 0;
+	intKeys[1] = 1; intValues[1] = 1;
+	intKeys[2] = 0; intValues[2] = 2;
+	intKeys[3] = 3; intValues[3] = 3;
+	intKeys[4] = 0; intValues[4] = 4;
+	intKeys[5] = 2; intValues[5] = 5;
+	intKeys[6] = 2; intValues[6] = 6;
+	intKeys[7] = 0; intValues[7] = 7;
+	intKeys[8] = 5; intValues[8] = 8;
+	intKeys[9] = 6; intValues[9] = 9;
+
+	cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
+
+	cudaMalloc((void**)&dev_intValues, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	std::cout << "before unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// How to copy data to the GPU
+	cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
+	thrust::device_ptr<int> dev_thrust_values(dev_intValues);
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
+
+	// How to copy data back to the CPU side from the GPU
+	cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+
+	std::cout << "after unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// cleanup
+	cudaFree(dev_intKeys);
+	cudaFree(dev_intValues);
+	checkCUDAErrorWithLine("cudaFree failed!");
+	return;
+}
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..7a5f9bf 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -1,8 +1,11 @@
 #define GLM_FORCE_CUDA
 #include <stdio.h>
 #include <cuda.h>
+#include <device_launch_parameters.h>
 #include <cmath>
 #include <glm/glm.hpp>
+#include <glm/gtc/type_ptr.hpp>
+#include <glm/gtc/epsilon.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
 
@@ -20,15 +23,15 @@
 /**
 * Check for CUDA errors; print and exit if there was a problem.
 */
-void checkCUDAError(const char *msg, int line = -1) {
-  cudaError_t err = cudaGetLastError();
-  if (cudaSuccess != err) {
-    if (line >= 0) {
-      fprintf(stderr, "Line %d: ", line);
-    }
-    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
-    exit(EXIT_FAILURE);
-  }
+void checkCUDAError(const char* msg, int line = -1) {
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess != err) {
+		if (line >= 0) {
+			fprintf(stderr, "Line %d: ", line);
+		}
+		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+		exit(EXIT_FAILURE);
+	}
 }
 
 
@@ -45,11 +48,15 @@ void checkCUDAError(const char *msg, int line = -1) {
 #define rule2Distance 3.0f
 #define rule3Distance 5.0f
 
+#define CUSTOMIZE_CELL_WIDTH 0
+#define SEARCH_27 false
+
 #define rule1Scale 0.01f
 #define rule2Scale 0.1f
 #define rule3Scale 0.1f
 
-#define maxSpeed 1.0f
+#define maxSpeed 4.0f
+#define epsilon 1e-4f
 
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
@@ -66,22 +73,23 @@ dim3 threadsPerBlock(blockSize);
 // Consider why you would need two velocity buffers in a simulation where each
 // boid cares about its neighbors' velocities.
 // These are called ping-pong buffers.
-glm::vec3 *dev_pos;
-glm::vec3 *dev_vel1;
-glm::vec3 *dev_vel2;
+glm::vec3* dev_pos1;
+glm::vec3* dev_pos2; //ping pong buffer for 2.3
+glm::vec3* dev_vel1;
+glm::vec3* dev_vel2;
 
 // LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
 // pointers on your own too.
 
 // For efficient sorting and the uniform grid. These should always be parallel.
-int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
-int *dev_particleGridIndices; // What grid cell is this particle in?
+int* dev_boidArrayIndices; // What index in dev_pos and dev_velX represents this boid?
+int* dev_boidGridIndices; // What grid cell is this boid in?
 // needed for use with thrust
-thrust::device_ptr<int> dev_thrust_particleArrayIndices;
-thrust::device_ptr<int> dev_thrust_particleGridIndices;
+std::unique_ptr<thrust::device_ptr<int>> dev_thrust_boidArrayIndices;
+std::unique_ptr<thrust::device_ptr<int>> dev_thrust_boidGridIndices;
 
-int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
-int *dev_gridCellEndIndices;   // to this cell?
+int* dev_gridCellStartIndices; // What part of dev_boidArrayIndices belongs
+int* dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
@@ -89,9 +97,11 @@ int *dev_gridCellEndIndices;   // to this cell?
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
+int halfGridSideCount;
 int gridSideCount;
 float gridCellWidth;
-float gridInverseCellWidth;
+float halfGridCellWidth;
+float gridinverseCellWidth;
 glm::vec3 gridMinimum;
 
 /******************
@@ -99,77 +109,107 @@ glm::vec3 gridMinimum;
 ******************/
 
 __host__ __device__ unsigned int hash(unsigned int a) {
-  a = (a + 0x7ed55d16) + (a << 12);
-  a = (a ^ 0xc761c23c) ^ (a >> 19);
-  a = (a + 0x165667b1) + (a << 5);
-  a = (a + 0xd3a2646c) ^ (a << 9);
-  a = (a + 0xfd7046c5) + (a << 3);
-  a = (a ^ 0xb55a4f09) ^ (a >> 16);
-  return a;
+	a = (a + 0x7ed55d16) + (a << 12);
+	a = (a ^ 0xc761c23c) ^ (a >> 19);
+	a = (a + 0x165667b1) + (a << 5);
+	a = (a + 0xd3a2646c) ^ (a << 9);
+	a = (a + 0xfd7046c5) + (a << 3);
+	a = (a ^ 0xb55a4f09) ^ (a >> 16);
+	return a;
 }
 
 /**
 * LOOK-1.2 - this is a typical helper function for a CUDA kernel.
 * Function for generating a random vec3.
 */
-__host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
-  thrust::default_random_engine rng(hash((int)(index * time)));
-  thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
+__host__ __device__ glm::vec3 generateRandomVec3(int seed, int index, float scale = 1.f) {
+	thrust::default_random_engine rng(hash(index * seed));
+	if (scale < 0) scale = -scale;
+	thrust::uniform_real_distribution<float> unitDistrib(-scale, scale);
 
-  return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
+	return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
 }
 
 /**
 * LOOK-1.2 - This is a basic CUDA kernel.
 * CUDA kernel for generating boids with a specified mass randomly around the star.
 */
-__global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    glm::vec3 rand = generateRandomVec3(time, index);
-    arr[index].x = scale * rand.x;
-    arr[index].y = scale * rand.y;
-    arr[index].z = scale * rand.z;
-  }
+__global__ void kernGenerateRandomArray(int seed, int N, glm::vec3* arr, float scale) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		arr[index] = generateRandomVec3(seed, index, scale);
+	}
 }
 
 /**
 * Initialize memory, update some globals
 */
 void Boids::initSimulation(int N) {
-  numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
-
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
-  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
-
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
-  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
-  gridSideCount = 2 * halfSideCount;
-
-  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
-  gridInverseCellWidth = 1.0f / gridCellWidth;
-  float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
-
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaDeviceSynchronize();
+	numObjects = N;
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	// LOOK-1.2 - This is basic CUDA memory management and error checking.
+	// Don't forget to cudaFree in  Boids::endSimulation.
+	cudaMalloc((void**)&dev_pos1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+	cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+	cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+
+	// LOOK-1.2 - This is a typical CUDA kernel invocation.
+	kernGenerateRandomArray <<<fullBlocksPerGrid, blockSize>>> (/*time(NULL)*/1, numObjects,
+		dev_pos1, scene_scale);
+	checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+	kernGenerateRandomArray <<<fullBlocksPerGrid, blockSize>>> (/*time(NULL)*/1, numObjects,
+		dev_vel1, maxSpeed);
+	checkCUDAErrorWithLine("kernGenerateRandomVelArray failed!");
+
+	// LOOK-2.1 computing grid params
+	if (CUSTOMIZE_CELL_WIDTH == 0) {
+		halfGridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+		gridCellWidth = 2.0f * halfGridCellWidth;
+		if (SEARCH_27) {
+			gridCellWidth /= 2.f;
+			halfGridCellWidth /= 2.f;
+		}
+	}
+	else {
+		halfGridCellWidth = CUSTOMIZE_CELL_WIDTH / 2.f;
+		gridCellWidth = CUSTOMIZE_CELL_WIDTH;
+	}
+	halfGridSideCount = (int)(scene_scale / gridCellWidth) + 1;
+	gridSideCount = 2 * halfGridSideCount;
+
+	gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+	gridinverseCellWidth = 1.0f / gridCellWidth;
+	float halfGridWidth = gridCellWidth * halfGridSideCount;
+	gridMinimum.x -= halfGridWidth;
+	gridMinimum.y -= halfGridWidth;
+	gridMinimum.z -= halfGridWidth;
+
+	// TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+
+	cudaMalloc((void**)&dev_boidArrayIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_boidArrayIndices failed!");
+	dev_thrust_boidArrayIndices = std::make_unique<thrust::device_ptr<int>>(dev_boidArrayIndices);
+
+	cudaMalloc((void**)&dev_boidGridIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_boidGridIndices failed!");
+	dev_thrust_boidGridIndices = std::make_unique<thrust::device_ptr<int>>(dev_boidGridIndices);
+
+	cudaMalloc((void**)&dev_gridCellStartIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+	cudaMalloc((void**)&dev_gridCellEndIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+	cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
+	cudaDeviceSynchronize();
 }
 
 
@@ -180,42 +220,42 @@ void Boids::initSimulation(int N) {
 /**
 * Copy the boid positions into the VBO so that they can be drawn by OpenGL.
 */
-__global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+__global__ void kernCopyPositionsToVBO(int N, glm::vec3* pos, float* vbo, float s_scale) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
 
-  float c_scale = -1.0f / s_scale;
+	float c_scale = -1.0f / s_scale;
 
-  if (index < N) {
-    vbo[4 * index + 0] = pos[index].x * c_scale;
-    vbo[4 * index + 1] = pos[index].y * c_scale;
-    vbo[4 * index + 2] = pos[index].z * c_scale;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	if (index < N) {
+		vbo[4 * index + 0] = pos[index].x * c_scale;
+		vbo[4 * index + 1] = pos[index].y * c_scale;
+		vbo[4 * index + 2] = pos[index].z * c_scale;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
-__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+__global__ void kernCopyVelocitiesToVBO(int N, glm::vec3* vel, float* vbo, float s_scale) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
 
-  if (index < N) {
-    vbo[4 * index + 0] = vel[index].x + 0.3f;
-    vbo[4 * index + 1] = vel[index].y + 0.3f;
-    vbo[4 * index + 2] = vel[index].z + 0.3f;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	if (index < N) {
+		vbo[4 * index + 0] = vel[index].x + 0.3f;
+		vbo[4 * index + 1] = vel[index].y + 0.3f;
+		vbo[4 * index + 2] = vel[index].z + 0.3f;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 /**
 * Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
 */
-void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
-  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+void Boids::copyBoidsToVBO(float* vbodptr_positions, float* vbodptr_velocities) {
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+	kernCopyPositionsToVBO <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos1, vbodptr_positions, scene_scale);
+	kernCopyVelocitiesToVBO <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
-  checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+	checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
-  cudaDeviceSynchronize();
+	cudaDeviceSynchronize();
 }
 
 
@@ -229,47 +269,83 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3* pos, const glm::vec3* vel) {
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	// Rule 2: boids try to stay a distance d away from each other
+	// Rule 3: boids try to match the speed of surrounding boids
+
+	// https://vergenet.net/~conrad/boids/pseudocode.html
+	float num1 = 0.f, /*num2 = 0.f, */num3 = 0.f;
+	glm::vec3 thisPos = pos[iSelf];
+	glm::vec3 thisVel = vel[iSelf];
+	glm::vec3 rule1Center, rule2Vel, rule3Vel;
+	for (int i = 0; i < N; i++) {
+		if (i == iSelf) continue;
+		glm::vec3 thatPos = pos[i];
+		glm::vec3 thatVel = vel[i];
+		float dist = glm::distance(thatPos, thisPos);
+		if (dist < rule1Distance) {
+			num1 += 1.f;
+			rule1Center += thatPos;
+		}
+		if (dist < rule2Distance) {
+			rule2Vel -= thatPos - thisPos;
+		}
+		if (dist < rule3Distance) {
+			num3 += 1.f;
+			rule3Vel += thatVel;
+		}
+	}
+	glm::vec3 outVel;
+	if (num1 > 0) outVel += (rule1Center / num1 - thisPos) * rule1Scale;
+	outVel += rule2Vel * rule2Scale;
+	if (num3 > 0) outVel += (rule3Vel / num3/* - thisVel*/) * rule3Scale;
+	return outVel;
 }
 
 /**
 * TODO-1.2 implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
-__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+__global__ void kernUpdateVelocityBruteForce(int N, glm::vec3* pos,
+	glm::vec3* vel1, glm::vec3* vel2) {
+	// Compute a new velocity based on pos and vel1
+	// Clamp the speed
+	// Record the new velocity into vel2. Question: why NOT vel1?
+
+	int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (idx >= N) return;
+	glm::vec3 outSpeed = vel1[idx] + computeVelocityChange(N, idx, pos, vel1);
+
+	//if (glm::length(outSpeed) > maxSpeed)
+	//	vel2[idx] = glm::normalize(outSpeed) * maxSpeed;
+	//else vel2[idx] = outSpeed;
+	vel2[idx] = glm::clamp(outSpeed, -maxSpeed, maxSpeed); // faster but not accurate
 }
 
 /**
 * LOOK-1.2 Since this is pretty trivial, we implemented it for you.
 * For each of the `N` bodies, update its position based on its current velocity.
 */
-__global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
-  // Update position by velocity
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (index >= N) {
-    return;
-  }
-  glm::vec3 thisPos = pos[index];
-  thisPos += vel[index] * dt;
-
-  // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
-
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
-
-  pos[index] = thisPos;
+__global__ void kernUpdatePos(int N, float dt, glm::vec3* pos, glm::vec3* vel) {
+	// Update position by velocity
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	thisPos += vel[index] * dt;
+
+	// Wrap the boids around so we don't lose them
+	thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+	thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+	thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+
+	thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+	thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+	thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+
+	pos[index] = thisPos;
 }
 
 // LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
@@ -278,180 +354,798 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
 //          for(x)
 //            for(y)
 //             for(z)? Or some other order?
-__device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+__device__ int gridIndex3Dto1D(int x, int y, int z, int sideCount) {
+	return x + y * sideCount + z * sideCount * sideCount;
+}
+__device__ glm::ivec3 gridIndex1Dto3D(int id, int sideCount) {
+	int x = id % sideCount;
+	int y = ((id - x) / sideCount) % sideCount;
+	int z = (id - x - y * sideCount) / (sideCount * sideCount);
+	return glm::ivec3(x, y, z);
 }
 
-__global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+__global__ void kernComputeIndices(int N, int sideCount,
+	glm::vec3 gridMin, float inverseCellWidth,
+	glm::vec3* pos, int* boidArrayIndices, int* boidGridIndices) {
+	// TODO-2.1
+	// - Label each boid with the index of its grid cell.
+	// - Set up a parallel array of integer indices as pointers to the actual
+	//   boid data in pos and vel1/vel2
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	boidArrayIndices[idx] = idx;
+	glm::vec3 vec = pos[idx] - gridMin;
+	int x = floor(vec.x * inverseCellWidth);
+	int y = floor(vec.y * inverseCellWidth);
+	int z = floor(vec.z * inverseCellWidth);
+	boidGridIndices[idx] = gridIndex3Dto1D(x, y, z, sideCount);
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
-__global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    intBuffer[index] = value;
-  }
+__global__ void kernResetIntBuffer(int N, int* intBuffer, int value) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		intBuffer[index] = value;
+	}
+}
+
+__global__ void kernIdentifyCellStartEnd(int N, int* boidGridIndices,
+	int* gridCellStartIndices, int* gridCellEndIndices) {
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+
+	// must be called after the sort
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	if (idx == 0) {
+		gridCellStartIndices[boidGridIndices[0]] = 0;
+	}
+	else if (boidGridIndices[idx] != boidGridIndices[idx - 1])
+		gridCellStartIndices[boidGridIndices[idx]] = idx;
+	if (idx == N - 1) {
+		gridCellEndIndices[boidGridIndices[N - 1]] = N - 1;
+	}
+	else if (boidGridIndices[idx] != boidGridIndices[idx + 1])
+		gridCellEndIndices[boidGridIndices[idx]] = idx;
 }
 
-__global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+__device__ glm::vec3 gridComputeVelocityChangeScattered(
+	int idx, int idxTrue, int* boidArrayIndices, int searchCell, int sideCount, int sideCount2, int searchX, int searchY, int searchZ,
+	int* gridCellStartIndices, int* gridCellEndIndices, const glm::vec3* pos, const glm::vec3* vel) {
+	float num1 = 0.f, /*num2 = 0.f, */num3 = 0.f;
+	glm::vec3 thisPos = pos[idxTrue];
+	glm::vec3 thisVel = vel[idxTrue];
+	glm::vec3 rule1Center, rule2Vel, rule3Vel;
+	{
+		int startIdx = gridCellStartIndices[searchCell];
+		int endIdx = gridCellEndIndices[searchCell];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX];
+		int endIdx = gridCellEndIndices[searchCell + searchX];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0 && searchY != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX + searchY * sideCount];
+		int endIdx = gridCellEndIndices[searchCell + searchX + searchY * sideCount];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0 && searchY != 0 && searchZ != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX + searchY * sideCount + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchX + searchY * sideCount + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0 && searchZ != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchX + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchY != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchY * sideCount];
+		int endIdx = gridCellEndIndices[searchCell + searchY * sideCount];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchZ != 0 && searchY != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchY * sideCount + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchY * sideCount + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchZ != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[boidArrayIndices[ii]];
+			glm::vec3 thatVel = vel[boidArrayIndices[ii]];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	glm::vec3 outVel;
+	if (num1 > 0) outVel += (rule1Center / num1 - thisPos) * rule1Scale;
+	outVel += rule2Vel * rule2Scale;
+	if (num3 > 0) outVel += (rule3Vel / num3/* - thisVel*/) * rule3Scale;
+	return outVel;
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int sideCount, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth, float halfCellWidth,
+	int* gridCellStartIndices, int* gridCellEndIndices,
+	int* boidArrayIndices, int* boidGridIndices,
+	glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+	// - Identify the grid cell that this boid is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	int idxTrue = boidArrayIndices[idx];
+	glm::vec3 thisPos = pos[idxTrue];
+	int cellIdx = boidGridIndices[idx];
+	glm::ivec3 cellCoord = gridIndex1Dto3D(cellIdx, sideCount);
+	glm::vec3 vec = thisPos - gridMin;
+	float x = vec.x - cellWidth * cellCoord.x;
+	float y = vec.y - cellWidth * cellCoord.y;
+	float z = vec.z - cellWidth * cellCoord.z;
+
+	int searchX = 0;
+	int searchY = 0;
+	int searchZ = 0;
+
+	glm::vec3 outSpeed = vel1[idxTrue];
+
+	if (x <  halfCellWidth && cellCoord.x > 0)           searchX = -1;
+	if (x >= halfCellWidth && cellCoord.x < sideCount-1) searchX = 1;
+	if (y <  halfCellWidth && cellCoord.y > 0)           searchY = -1;
+	if (y >= halfCellWidth && cellCoord.y < sideCount-1) searchY = 1;
+	if (z <  halfCellWidth && cellCoord.z > 0)           searchZ = -1;
+	if (z >= halfCellWidth && cellCoord.z < sideCount-1) searchZ = 1;
+
+	float sideCount2 = sideCount*sideCount;
+	outSpeed += gridComputeVelocityChangeScattered(idx, idxTrue, boidArrayIndices, cellIdx, sideCount, sideCount2, searchX, searchY, searchZ, gridCellStartIndices, gridCellEndIndices, pos, vel1);
+
+	vel2[idxTrue] = glm::clamp(outSpeed, -maxSpeed, maxSpeed);
 }
 
-__global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+__device__ glm::vec3 gridComputeVelocityChangeCoherent8(
+	int idx, int searchCell, int sideCount, int sideCount2, int searchX, int searchY, int searchZ,
+	int* gridCellStartIndices, int* gridCellEndIndices, const glm::vec3* pos, const glm::vec3* vel) {
+	float num1 = 0.f, /*num2 = 0.f, */num3 = 0.f;
+	glm::vec3 thisPos = pos[idx];
+	glm::vec3 thisVel = vel[idx];
+	glm::vec3 rule1Center, rule2Vel, rule3Vel;
+	{
+		int startIdx = gridCellStartIndices[searchCell];
+		int endIdx = gridCellEndIndices[searchCell];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX];
+		int endIdx = gridCellEndIndices[searchCell + searchX];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0 && searchY != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX + searchY * sideCount];
+		int endIdx = gridCellEndIndices[searchCell + searchX + searchY * sideCount];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0 && searchY != 0 && searchZ != 0) {
+		int startIdx = gridCellStartIndices[searchCell +searchX + searchY * sideCount + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell +searchX + searchY * sideCount + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchX != 0 && searchZ != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchX + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchX + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchY != 0) {
+		int startIdx = gridCellStartIndices[searchCell + searchY * sideCount];
+		int endIdx = gridCellEndIndices[searchCell + searchY * sideCount];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchZ != 0 && searchY != 0) {
+		int startIdx = gridCellStartIndices[searchCell +searchY * sideCount + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchY * sideCount + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	if (searchZ != 0) {
+		int startIdx = gridCellStartIndices[searchCell  + searchZ * sideCount2];
+		int endIdx = gridCellEndIndices[searchCell + searchZ * sideCount2];
+		if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+		for (int ii = startIdx; ii <= endIdx; ii++) {
+			if (ii == idx) continue;
+			glm::vec3 thatPos = pos[ii];
+			glm::vec3 thatVel = vel[ii];
+			float dist = glm::distance(thatPos, thisPos);
+			if (dist < rule1Distance) {
+				num1 += 1.f;
+				rule1Center += thatPos;
+			}
+			if (dist < rule2Distance) {
+				rule2Vel -= thatPos - thisPos;
+			}
+			if (dist < rule3Distance) {
+				num3 += 1.f;
+				rule3Vel += thatVel;
+			}
+		}
+	}
+	
+	glm::vec3 outVel;
+	if (num1 > 0) outVel += (rule1Center / num1 - thisPos) * rule1Scale;
+	outVel += rule2Vel * rule2Scale;
+	if (num3 > 0) outVel += (rule3Vel / num3/* - thisVel*/) * rule3Scale;
+	return outVel;
+}
+
+
+__global__ void kernUpdateVelNeighborSearchCoherent8(
+	int N, int sideCount, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth, float halfCellWidth,
+	int* gridCellStartIndices, int* gridCellEndIndices, int* boidGridIndices,
+	glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+	// - Identify the grid cell that this boid is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	glm::vec3 thisPos = pos[idx];
+	int cellIdx = boidGridIndices[idx];
+	glm::ivec3 cellCoord = gridIndex1Dto3D(cellIdx, sideCount);
+	glm::vec3 vec = thisPos - gridMin;
+	float x = vec.x - cellWidth * cellCoord.x;
+	float y = vec.y - cellWidth * cellCoord.y;
+	float z = vec.z - cellWidth * cellCoord.z;
+
+	int searchX = 0;
+	int searchY = 0;
+	int searchZ = 0;
+
+	glm::vec3 outSpeed = vel1[idx];
+
+	if (x < halfCellWidth && cellCoord.x > 0)              searchX = -1;
+	if (x >= halfCellWidth && cellCoord.x < sideCount - 1) searchX = 1;
+	if (y < halfCellWidth && cellCoord.y > 0)              searchY = -1;
+	if (y >= halfCellWidth && cellCoord.y < sideCount - 1) searchY = 1;
+	if (z < halfCellWidth && cellCoord.z > 0)              searchZ = -1;
+	if (z >= halfCellWidth && cellCoord.z < sideCount - 1) searchZ = 1;
+
+	outSpeed += gridComputeVelocityChangeCoherent8(idx, cellIdx, sideCount, sideCount * sideCount, searchX, searchY, searchZ, gridCellStartIndices, gridCellEndIndices, pos, vel1);
+
+	vel2[idx] = glm::clamp(outSpeed, -maxSpeed, maxSpeed);
+}
+
+
+__device__ glm::vec3 gridComputeVelocityChangeCoherent27(
+	int idx, int searchCell, int maxCellIdx, int sideCount, int sideCount2, int* gridCellStartIndices, int* gridCellEndIndices, const glm::vec3* pos, const glm::vec3* vel) {
+	float num1 = 0.f, /*num2 = 0.f, */num3 = 0.f;
+	glm::vec3 thisPos = pos[idx];
+	glm::vec3 thisVel = vel[idx];
+	glm::vec3 rule1Center, rule2Vel, rule3Vel;
+	for (int dx = -1; dx <= 1; dx++) {
+		for (int dy = -1; dy <= 1; dy++) {
+			for (int dz = -1; dz <= 1; dz++) {
+				int cellIdx = searchCell + dx + dy * sideCount + dz * sideCount2;
+				if (cellIdx < 0 || cellIdx > maxCellIdx) return glm::vec3();
+				int startIdx = gridCellStartIndices[cellIdx];
+				int endIdx = gridCellEndIndices[cellIdx];
+				if (startIdx == -1 || endIdx == -1) return glm::vec3(); //cell doesn't have boids
+				for (int ii = startIdx; ii <= endIdx; ii++) {
+					if (ii == idx) continue;
+					glm::vec3 thatPos = pos[ii];
+					glm::vec3 thatVel = vel[ii];
+					float dist = glm::distance(thatPos, thisPos);
+					if (dist < rule1Distance) {
+						num1 += 1.f;
+						rule1Center += thatPos;
+					}
+					if (dist < rule2Distance) {
+						rule2Vel -= thatPos - thisPos;
+					}
+					if (dist < rule3Distance) {
+						num3 += 1.f;
+						rule3Vel += thatVel;
+					}
+				}
+			}
+		}
+	}
+
+	glm::vec3 outVel;
+	if (num1 > 0) outVel += (rule1Center / num1 - thisPos) * rule1Scale;
+	outVel += rule2Vel * rule2Scale;
+	if (num3 > 0) outVel += (rule3Vel / num3/* - thisVel*/) * rule3Scale;
+	return outVel;
+}
+
+__global__ void kernUpdateVelNeighborSearchCoherent27(
+	int N, int sideCount, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth, float halfCellWidth,
+	int* gridCellStartIndices, int* gridCellEndIndices, int* boidGridIndices,
+	glm::vec3* pos, glm::vec3* vel1, glm::vec3* vel2) {
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+	// - Identify the grid cell that this boid is in
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	glm::vec3 thisPos = pos[idx];
+	int cellIdx = boidGridIndices[idx];
+
+	glm::vec3 outSpeed = vel1[idx];
+
+	float sideCount2 = sideCount * sideCount;
+	float maxCellIdx = sideCount2 * sideCount-1;
+	outSpeed += gridComputeVelocityChangeCoherent27(idx, cellIdx, maxCellIdx, sideCount, sideCount2, gridCellStartIndices, gridCellEndIndices, pos, vel1);
+
+	vel2[idx] = glm::clamp(outSpeed, -maxSpeed, maxSpeed);
+}
+
+__global__ void kernRearrangePosVel(
+	int N, int* boidArrayIndices, glm::vec3 *pos1, glm::vec3* pos2, glm::vec3* vel1, glm::vec3* vel2) {
+	int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (idx >= N) return;
+	int idxTrue = boidArrayIndices[idx];
+	pos2[idx] = pos1[idxTrue];
+	vel2[idx] = vel1[idxTrue];
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	// TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos1, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos1, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	// In Parallel:
+	// - label each boid with its array index as well as its grid index.
+	//   Use 2x width grids.
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// - Perform velocity updates using neighbor search
+	// - Update positions
+	// - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices <<<fullBlocksPerGrid, blockSize >>> (
+		numObjects, 
+		gridSideCount, 
+		gridMinimum, 
+		gridinverseCellWidth, 
+		dev_pos1, 
+		dev_boidArrayIndices, 
+		dev_boidGridIndices);
+
+	//int    keys[6] = { 1,   4,   2,   8,   5,   7 };
+	//char values[6] = { 'a', 'b', 'c', 'd', 'e', 'f' };
+	//thrust::sort_by_key(thrust::host, keys, keys + 6, values);
+	// keys is now   {  1,   2,   4,   5,   7,   8}
+	// values is now {'a', 'c', 'b', 'e', 'f', 'd'}
+	thrust::sort_by_key(*dev_thrust_boidGridIndices, *dev_thrust_boidGridIndices + numObjects, *dev_thrust_boidArrayIndices);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_boidGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	kernUpdateVelNeighborSearchScattered <<<fullBlocksPerGrid, blockSize>>> (
+		numObjects, gridSideCount, gridMinimum, gridinverseCellWidth, gridCellWidth, halfGridCellWidth, dev_gridCellStartIndices,
+		dev_gridCellEndIndices, dev_boidArrayIndices, dev_boidGridIndices, dev_pos1, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize >>> (numObjects, dt, dev_pos1, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	// In Parallel:
+	// - Label each boid with its array index as well as its grid index.
+	//   Use 2x width grids
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the boid data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	// - Perform velocity updates using neighbor search
+	// - Update positions
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+		dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (
+		numObjects,
+		gridSideCount,
+		gridMinimum,
+		gridinverseCellWidth,
+		dev_pos1,
+		dev_boidArrayIndices,
+		dev_boidGridIndices);
+
+	thrust::sort_by_key(*dev_thrust_boidGridIndices, *dev_thrust_boidGridIndices + numObjects, *dev_thrust_boidArrayIndices);
+	kernRearrangePosVel << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_boidArrayIndices, dev_pos1, dev_pos2, dev_vel1, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
+	std::swap(dev_pos1, dev_pos2);
+
+	kernResetIntBuffer <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_gridCellEndIndices, -1);
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_boidGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	if (!SEARCH_27)
+		kernUpdateVelNeighborSearchCoherent8 <<<fullBlocksPerGrid, blockSize >>> (
+			numObjects, gridSideCount, gridMinimum, gridinverseCellWidth, gridCellWidth, halfGridCellWidth, dev_gridCellStartIndices,
+			dev_gridCellEndIndices, dev_boidGridIndices, dev_pos1, dev_vel1, dev_vel2);
+	else
+		kernUpdateVelNeighborSearchCoherent27 << <fullBlocksPerGrid, blockSize >> > (
+			numObjects, gridSideCount, gridMinimum, gridinverseCellWidth, gridCellWidth, halfGridCellWidth, dev_gridCellStartIndices,
+			dev_gridCellEndIndices, dev_boidGridIndices, dev_pos1, dev_vel1, dev_vel2);
+	kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos1, dev_vel2);
+	std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::endSimulation() {
-  cudaFree(dev_vel1);
-  cudaFree(dev_vel2);
-  cudaFree(dev_pos);
+	cudaFree(dev_vel1);
+	cudaFree(dev_vel2);
+	cudaFree(dev_pos1);
+
+	// TODO-2.1 TODO-2.3 - Free any additional buffers here.
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_boidArrayIndices);
+	cudaFree(dev_boidGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+	cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
-  // test unstable sort
-  int *dev_intKeys;
-  int *dev_intValues;
-  int N = 10;
-
-  std::unique_ptr<int[]>intKeys{ new int[N] };
-  std::unique_ptr<int[]>intValues{ new int[N] };
-
-  intKeys[0] = 0; intValues[0] = 0;
-  intKeys[1] = 1; intValues[1] = 1;
-  intKeys[2] = 0; intValues[2] = 2;
-  intKeys[3] = 3; intValues[3] = 3;
-  intKeys[4] = 0; intValues[4] = 4;
-  intKeys[5] = 2; intValues[5] = 5;
-  intKeys[6] = 2; intValues[6] = 6;
-  intKeys[7] = 0; intValues[7] = 7;
-  intKeys[8] = 5; intValues[8] = 8;
-  intKeys[9] = 6; intValues[9] = 9;
-
-  cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
-
-  cudaMalloc((void**)&dev_intValues, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
-
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  std::cout << "before unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // How to copy data to the GPU
-  cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-  cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-
-  // Wrap device vectors in thrust iterators for use with thrust.
-  thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
-  thrust::device_ptr<int> dev_thrust_values(dev_intValues);
-  // LOOK-2.1 Example for using thrust::sort_by_key
-  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
-
-  // How to copy data back to the CPU side from the GPU
-  cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  checkCUDAErrorWithLine("memcpy back failed!");
-
-  std::cout << "after unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // cleanup
-  cudaFree(dev_intKeys);
-  cudaFree(dev_intValues);
-  checkCUDAErrorWithLine("cudaFree failed!");
-  return;
+	// LOOK-1.2 Feel free to write additional tests here.
+
+	// test unstable sort
+	int* dev_intKeys;
+	int* dev_intValues;
+	int N = 10;
+
+	std::unique_ptr<int[]>intKeys{ new int[N] };
+	std::unique_ptr<int[]>intValues{ new int[N] };
+
+	intKeys[0] = 0; intValues[0] = 0;
+	intKeys[1] = 1; intValues[1] = 1;
+	intKeys[2] = 0; intValues[2] = 2;
+	intKeys[3] = 3; intValues[3] = 3;
+	intKeys[4] = 0; intValues[4] = 4;
+	intKeys[5] = 2; intValues[5] = 5;
+	intKeys[6] = 2; intValues[6] = 6;
+	intKeys[7] = 0; intValues[7] = 7;
+	intKeys[8] = 5; intValues[8] = 8;
+	intKeys[9] = 6; intValues[9] = 9;
+
+	cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
+
+	cudaMalloc((void**)&dev_intValues, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	std::cout << "before unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// How to copy data to the GPU
+	cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+	cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
+
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
+	thrust::device_ptr<int> dev_thrust_values(dev_intValues);
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
+
+	// How to copy data back to the CPU side from the GPU
+	cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+
+	std::cout << "after unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// cleanup
+	cudaFree(dev_intKeys);
+	cudaFree(dev_intValues);
+	checkCUDAErrorWithLine("cudaFree failed!");
+	return;
 }
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..7badc15 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,27 +13,29 @@
 // ================
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
-#define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define VISUALIZE 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 2000000;
 const float DT = 0.2f;
+unsigned int steps_in_run = 0;
 
 /**
 * C main function.
 */
 int main(int argc, char* argv[]) {
-  projectName = "565 CUDA Intro: Boids";
-
-  if (init(argc, argv)) {
-    mainLoop();
-    Boids::endSimulation();
-    return 0;
-  } else {
-    return 1;
-  }
+	projectName = "565 CUDA Intro: Boids by Alex Fu";
+
+	if (init(argc, argv)) {
+		mainLoop();
+		Boids::endSimulation();
+		return 0;
+	}
+	else {
+		return 1;
+	}
 }
 
 //-------------------------------
@@ -41,273 +43,302 @@ int main(int argc, char* argv[]) {
 //-------------------------------
 
 std::string deviceName;
-GLFWwindow *window;
+GLFWwindow* window;
 
 /**
 * Initialization of CUDA and GLFW.
 */
-bool init(int argc, char **argv) {
-  // Set window title to "Student Name: [SM 2.0] GPU Name"
-  cudaDeviceProp deviceProp;
-  int gpuDevice = 0;
-  int device_count = 0;
-  cudaGetDeviceCount(&device_count);
-  if (gpuDevice > device_count) {
-    std::cout
-    << "Error: GPU device number is greater than the number of devices!"
-    << " Perhaps a CUDA-capable GPU is not installed?"
-    << std::endl;
-    return false;
-  }
-  cudaGetDeviceProperties(&deviceProp, gpuDevice);
-  int major = deviceProp.major;
-  int minor = deviceProp.minor;
-
-  std::ostringstream ss;
-  ss << projectName << " [SM " << major << "." << minor << " " << deviceProp.name << "]";
-  deviceName = ss.str();
-
-  // Window setup stuff
-  glfwSetErrorCallback(errorCallback);
-
-  if (!glfwInit()) {
-    std::cout
-    << "Error: Could not initialize GLFW!"
-    << " Perhaps OpenGL 3.3 isn't available?"
-    << std::endl;
-    return false;
-  }
-
-  glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
-  glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);
-  glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);
-  glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
-
-  window = glfwCreateWindow(width, height, deviceName.c_str(), NULL, NULL);
-  if (!window) {
-    glfwTerminate();
-    return false;
-  }
-  glfwMakeContextCurrent(window);
-  glfwSetKeyCallback(window, keyCallback);
-  glfwSetCursorPosCallback(window, mousePositionCallback);
-  glfwSetMouseButtonCallback(window, mouseButtonCallback);
-
-  glewExperimental = GL_TRUE;
-  if (glewInit() != GLEW_OK) {
-    return false;
-  }
-
-  // Initialize drawing state
-  initVAO();
-
-  // Default to device ID 0. If you have more than one GPU and want to test a non-default one,
-  // change the device ID.
-  cudaGLSetGLDevice(0);
-
-  cudaGLRegisterBufferObject(boidVBO_positions);
-  cudaGLRegisterBufferObject(boidVBO_velocities);
-
-  // Initialize N-body simulation
-  Boids::initSimulation(N_FOR_VIS);
-
-  updateCamera();
-
-  initShaders(program);
-
-  glEnable(GL_DEPTH_TEST);
-
-  return true;
+bool init(int argc, char** argv) {
+	// Set window title to "Student Name: [SM 2.0] GPU Name"
+	cudaDeviceProp deviceProp;
+	int gpuDevice = 0;
+	int device_count = 0;
+	cudaGetDeviceCount(&device_count);
+	if (gpuDevice > device_count) {
+		std::cout
+			<< "Error: GPU device number is greater than the number of devices!"
+			<< " Perhaps a CUDA-capable GPU is not installed?"
+			<< std::endl;
+		return false;
+	}
+	cudaGetDeviceProperties(&deviceProp, gpuDevice);
+	int major = deviceProp.major;
+	int minor = deviceProp.minor;
+
+	std::ostringstream ss;
+	ss << "[" << N_FOR_VIS << " boids] [SM " << major << "." << minor << " " << deviceProp.name << "]";
+	deviceName = ss.str();
+
+	// Window setup stuff
+	glfwSetErrorCallback(errorCallback);
+
+	if (!glfwInit()) {
+		std::cout
+			<< "Error: Could not initialize GLFW!"
+			<< " Perhaps OpenGL 3.3 isn't available?"
+			<< std::endl;
+		return false;
+	}
+
+	glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
+	glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);
+	glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GL_TRUE);
+	glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
+
+	window = glfwCreateWindow(width, height, deviceName.c_str(), NULL, NULL);
+	if (!window) {
+		glfwTerminate();
+		return false;
+	}
+	glfwMakeContextCurrent(window);
+	glfwSetKeyCallback(window, keyCallback);
+	glfwSetCursorPosCallback(window, mousePositionCallback);
+	glfwSetMouseButtonCallback(window, mouseButtonCallback);
+
+	glewExperimental = GL_TRUE;
+	if (glewInit() != GLEW_OK) {
+		return false;
+	}
+
+	// Initialize drawing state
+	initVAO();
+
+	// Default to device ID 0. If you have more than one GPU and want to test a non-default one,
+	// change the device ID.
+	cudaGLSetGLDevice(0);
+
+	cudaGLRegisterBufferObject(boidVBO_positions);
+	cudaGLRegisterBufferObject(boidVBO_velocities);
+
+	// Initialize N-body simulation
+	Boids::initSimulation(N_FOR_VIS);
+
+	updateCamera();
+
+	initShaders(program);
+
+	glEnable(GL_DEPTH_TEST);
+
+	return true;
 }
 
 void initVAO() {
 
-  std::unique_ptr<GLfloat[]> bodies{ new GLfloat[4 * (N_FOR_VIS)] };
-  std::unique_ptr<GLuint[]> bindices{ new GLuint[N_FOR_VIS] };
+	std::unique_ptr<GLfloat[]> bodies{ new GLfloat[4 * (N_FOR_VIS)] };
+	std::unique_ptr<GLuint[]> bindices{ new GLuint[N_FOR_VIS] };
 
-  glm::vec4 ul(-1.0, -1.0, 1.0, 1.0);
-  glm::vec4 lr(1.0, 1.0, 0.0, 0.0);
+	glm::vec4 ul(-1.0, -1.0, 1.0, 1.0);
+	glm::vec4 lr(1.0, 1.0, 0.0, 0.0);
 
-  for (int i = 0; i < N_FOR_VIS; i++) {
-    bodies[4 * i + 0] = 0.0f;
-    bodies[4 * i + 1] = 0.0f;
-    bodies[4 * i + 2] = 0.0f;
-    bodies[4 * i + 3] = 1.0f;
-    bindices[i] = i;
-  }
+	for (int i = 0; i < N_FOR_VIS; i++) {
+		bodies[4 * i + 0] = 0.0f;
+		bodies[4 * i + 1] = 0.0f;
+		bodies[4 * i + 2] = 0.0f;
+		bodies[4 * i + 3] = 1.0f;
+		bindices[i] = i;
+	}
 
 
-  glGenVertexArrays(1, &boidVAO); // Attach everything needed to draw a particle to this
-  glGenBuffers(1, &boidVBO_positions);
-  glGenBuffers(1, &boidVBO_velocities);
-  glGenBuffers(1, &boidIBO);
+	glGenVertexArrays(1, &boidVAO); // Attach everything needed to draw a particle to this
+	glGenBuffers(1, &boidVBO_positions);
+	glGenBuffers(1, &boidVBO_velocities);
+	glGenBuffers(1, &boidIBO);
 
-  glBindVertexArray(boidVAO);
+	glBindVertexArray(boidVAO);
 
-  // Bind the positions array to the boidVAO by way of the boidVBO_positions
-  glBindBuffer(GL_ARRAY_BUFFER, boidVBO_positions); // bind the buffer
-  glBufferData(GL_ARRAY_BUFFER, 4 * (N_FOR_VIS) * sizeof(GLfloat), bodies.get(), GL_DYNAMIC_DRAW); // transfer data
+	// Bind the positions array to the boidVAO by way of the boidVBO_positions
+	glBindBuffer(GL_ARRAY_BUFFER, boidVBO_positions); // bind the buffer
+	glBufferData(GL_ARRAY_BUFFER, 4 * (N_FOR_VIS) * sizeof(GLfloat), bodies.get(), GL_DYNAMIC_DRAW); // transfer data
 
-  glEnableVertexAttribArray(positionLocation);
-  glVertexAttribPointer((GLuint)positionLocation, 4, GL_FLOAT, GL_FALSE, 0, 0);
+	glEnableVertexAttribArray(positionLocation);
+	glVertexAttribPointer((GLuint)positionLocation, 4, GL_FLOAT, GL_FALSE, 0, 0);
 
-  // Bind the velocities array to the boidVAO by way of the boidVBO_velocities
-  glBindBuffer(GL_ARRAY_BUFFER, boidVBO_velocities);
-  glBufferData(GL_ARRAY_BUFFER, 4 * (N_FOR_VIS) * sizeof(GLfloat), bodies.get(), GL_DYNAMIC_DRAW);
-  glEnableVertexAttribArray(velocitiesLocation);
-  glVertexAttribPointer((GLuint)velocitiesLocation, 4, GL_FLOAT, GL_FALSE, 0, 0);
+	// Bind the velocities array to the boidVAO by way of the boidVBO_velocities
+	glBindBuffer(GL_ARRAY_BUFFER, boidVBO_velocities);
+	glBufferData(GL_ARRAY_BUFFER, 4 * (N_FOR_VIS) * sizeof(GLfloat), bodies.get(), GL_DYNAMIC_DRAW);
+	glEnableVertexAttribArray(velocitiesLocation);
+	glVertexAttribPointer((GLuint)velocitiesLocation, 4, GL_FLOAT, GL_FALSE, 0, 0);
 
-  glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, boidIBO);
-  glBufferData(GL_ELEMENT_ARRAY_BUFFER, (N_FOR_VIS) * sizeof(GLuint), bindices.get(), GL_STATIC_DRAW);
+	glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, boidIBO);
+	glBufferData(GL_ELEMENT_ARRAY_BUFFER, (N_FOR_VIS) * sizeof(GLuint), bindices.get(), GL_STATIC_DRAW);
 
-  glBindVertexArray(0);
+	glBindVertexArray(0);
 }
 
-void initShaders(GLuint * program) {
-  GLint location;
-
-  program[PROG_BOID] = glslUtility::createProgram(
-    "shaders/boid.vert.glsl",
-    "shaders/boid.geom.glsl",
-    "shaders/boid.frag.glsl", attributeLocations, 2);
-    glUseProgram(program[PROG_BOID]);
-
-    if ((location = glGetUniformLocation(program[PROG_BOID], "u_projMatrix")) != -1) {
-      glUniformMatrix4fv(location, 1, GL_FALSE, &projection[0][0]);
-    }
-    if ((location = glGetUniformLocation(program[PROG_BOID], "u_cameraPos")) != -1) {
-      glUniform3fv(location, 1, &cameraPosition[0]);
-    }
-  }
-
-  //====================================
-  // Main loop
-  //====================================
-  void runCUDA() {
-    // Map OpenGL buffer object for writing from CUDA on a single GPU
-    // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not
-    // use this buffer
-
-    float4 *dptr = NULL;
-    float *dptrVertPositions = NULL;
-    float *dptrVertVelocities = NULL;
-
-    cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
-    cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
-
-    // execute the kernel
-    #if UNIFORM_GRID && COHERENT_GRID
-    Boids::stepSimulationCoherentGrid(DT);
-    #elif UNIFORM_GRID
-    Boids::stepSimulationScatteredGrid(DT);
-    #else
-    Boids::stepSimulationNaive(DT);
-    #endif
-
-    #if VISUALIZE
-    Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
-    #endif
-    // unmap buffer object
-    cudaGLUnmapBufferObject(boidVBO_positions);
-    cudaGLUnmapBufferObject(boidVBO_velocities);
-  }
-
-  void mainLoop() {
-    double fps = 0;
-    double timebase = 0;
-    int frame = 0;
-
-    Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
-                       // your CUDA development setup is ready to go.
-
-    while (!glfwWindowShouldClose(window)) {
-      glfwPollEvents();
-
-      frame++;
-      double time = glfwGetTime();
-
-      if (time - timebase > 1.0) {
-        fps = frame / (time - timebase);
-        timebase = time;
-        frame = 0;
-      }
-
-      runCUDA();
-
-      std::ostringstream ss;
-      ss << "[";
-      ss.precision(1);
-      ss << std::fixed << fps;
-      ss << " fps] " << deviceName;
-      glfwSetWindowTitle(window, ss.str().c_str());
-
-      glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
-
-      #if VISUALIZE
-      glUseProgram(program[PROG_BOID]);
-      glBindVertexArray(boidVAO);
-      glPointSize((GLfloat)pointSize);
-      glDrawElements(GL_POINTS, N_FOR_VIS + 1, GL_UNSIGNED_INT, 0);
-      glPointSize(1.0f);
-
-      glUseProgram(0);
-      glBindVertexArray(0);
-
-      glfwSwapBuffers(window);
-      #endif
-    }
-    glfwDestroyWindow(window);
-    glfwTerminate();
-  }
-
-
-  void errorCallback(int error, const char *description) {
-    fprintf(stderr, "error %d: %s\n", error, description);
-  }
-
-  void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods) {
-    if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
-      glfwSetWindowShouldClose(window, GL_TRUE);
-    }
-  }
-
-  void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods) {
-    leftMousePressed = (button == GLFW_MOUSE_BUTTON_LEFT && action == GLFW_PRESS);
-    rightMousePressed = (button == GLFW_MOUSE_BUTTON_RIGHT && action == GLFW_PRESS);
-  }
-
-  void mousePositionCallback(GLFWwindow* window, double xpos, double ypos) {
-    if (leftMousePressed) {
-      // compute new camera parameters
-      phi += (xpos - lastX) / width;
-      theta -= (ypos - lastY) / height;
-      theta = std::fmax(0.01f, std::fmin(theta, 3.14f));
-      updateCamera();
-    }
-    else if (rightMousePressed) {
-      zoom += (ypos - lastY) / height;
-      zoom = std::fmax(0.1f, std::fmin(zoom, 5.0f));
-      updateCamera();
-    }
+void initShaders(GLuint* program) {
+	GLint location;
+
+	program[PROG_BOID] = glslUtility::createProgram(
+		"shaders/boid.vert.glsl",
+		"shaders/boid.geom.glsl",
+		"shaders/boid.frag.glsl", attributeLocations, 2);
+	glUseProgram(program[PROG_BOID]);
+
+	if ((location = glGetUniformLocation(program[PROG_BOID], "u_projMatrix")) != -1) {
+		glUniformMatrix4fv(location, 1, GL_FALSE, &projection[0][0]);
+	}
+	if ((location = glGetUniformLocation(program[PROG_BOID], "u_cameraPos")) != -1) {
+		glUniform3fv(location, 1, &cameraPosition[0]);
+	}
+}
+
+//====================================
+// Main loop
+//====================================
+void runCUDA(float deltaTime) {
+	// Map OpenGL buffer object for writing from CUDA on a single GPU
+	// No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not
+	// use this buffer
+
+	float4* dptr = NULL;
+	float* dptrVertPositions = NULL;
+	float* dptrVertVelocities = NULL;
+
+	cudaGLMapBufferObject((void**)&dptrVertPositions, boidVBO_positions);
+	cudaGLMapBufferObject((void**)&dptrVertVelocities, boidVBO_velocities);
+
+	// execute the kernel
+#if UNIFORM_GRID && COHERENT_GRID
+	Boids::stepSimulationCoherentGrid(deltaTime);
+#elif UNIFORM_GRID
+	Boids::stepSimulationScatteredGrid(deltaTime);
+#else
+	Boids::stepSimulationNaive(deltaTime);
+#endif
+
+	steps_in_run++;
+
+#if VISUALIZE
+	Boids::copyBoidsToVBO(dptrVertPositions, dptrVertVelocities);
+#endif
+	// unmap buffer object
+	cudaGLUnmapBufferObject(boidVBO_positions);
+	cudaGLUnmapBufferObject(boidVBO_velocities);
+}
+
+void mainLoop() {
+	bool initFlag = true;
+	float lastTime = 0.f;
+	float fps = 0.f;
+	float refreshTimer = 0.f;
+	int frameCount = 0;
+	float timer = 0.f;
+	int frameCount2 = 0;
+	float _startTime = 1.f;
+	bool _startFlag = false;
+	bool _printFlag = false;
+
+
+	//Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+						// your CUDA development setup is ready to go.
+
+	while (!glfwWindowShouldClose(window)) {
+		glfwPollEvents();
+
+		frameCount++;
+		float time = static_cast<float>(glfwGetTime());
+		float dt = time - lastTime;
+		refreshTimer += dt;
+		timer += dt;
+		if (refreshTimer > 0.5f) {
+			fps = frameCount / refreshTimer;
+			refreshTimer = 0.f;
+			frameCount = 0;
+			std::ostringstream ss;
+			ss << projectName;
+			ss << " [";
+			ss.precision(1);
+			ss << std::fixed << fps;
+			ss << " fps] [";
+			ss << steps_in_run;
+			ss << " steps] " << deviceName;
+			glfwSetWindowTitle(window, ss.str().c_str());
+			//std::cout << timer << "    " << fps << std::endl;
+		}
+		if (timer >= 1.f && timer < 11.f) {
+			if (!_startFlag) {
+				_startFlag = true;
+				_startTime = timer;
+			}
+			frameCount2++;
+		}
+		else if (timer >= 11.f && !_printFlag) {
+			_printFlag = true;
+			std::cout << "Average FPS from 1-11: " << (frameCount2 / (timer - _startTime));
+		}
+		if (!initFlag) {
+			runCUDA(dt);
+		}
+		else initFlag = false;
+		glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
+
+#if VISUALIZE
+		glUseProgram(program[PROG_BOID]);
+		glBindVertexArray(boidVAO);
+		glPointSize((GLfloat)pointSize);
+		glDrawElements(GL_POINTS, N_FOR_VIS + 1, GL_UNSIGNED_INT, 0);
+		glPointSize(1.0f);
+
+		glUseProgram(0);
+		glBindVertexArray(0);
+
+		glfwSwapBuffers(window);
+#endif
+
+		lastTime = time;
+	}
+	glfwDestroyWindow(window);
+	glfwTerminate();
+}
+
+
+void errorCallback(int error, const char* description) {
+	fprintf(stderr, "error %d: %s\n", error, description);
+}
+
+void keyCallback(GLFWwindow* window, int key, int scancode, int action, int mods) {
+	if (key == GLFW_KEY_ESCAPE && action == GLFW_PRESS) {
+		glfwSetWindowShouldClose(window, GL_TRUE);
+	}
+}
+
+void mouseButtonCallback(GLFWwindow* window, int button, int action, int mods) {
+	leftMousePressed = (button == GLFW_MOUSE_BUTTON_LEFT && action == GLFW_PRESS);
+	rightMousePressed = (button == GLFW_MOUSE_BUTTON_RIGHT && action == GLFW_PRESS);
+}
+
+void mousePositionCallback(GLFWwindow* window, double xpos, double ypos) {
+	if (leftMousePressed) {
+		// compute new camera parameters
+		phi += (xpos - lastX) / width;
+		theta -= (ypos - lastY) / height;
+		theta = std::fmax(0.01f, std::fmin(theta, 3.14f));
+		updateCamera();
+	}
+	else if (rightMousePressed) {
+		zoom += (ypos - lastY) / height;
+		zoom = std::fmax(0.1f, std::fmin(zoom, 5.0f));
+		updateCamera();
+	}
 
 	lastX = xpos;
 	lastY = ypos;
-  }
+}
 
-  void updateCamera() {
-    cameraPosition.x = zoom * sin(phi) * sin(theta);
-    cameraPosition.z = zoom * cos(theta);
-    cameraPosition.y = zoom * cos(phi) * sin(theta);
-    cameraPosition += lookAt;
+void updateCamera() {
+	cameraPosition.x = zoom * sin(phi) * sin(theta);
+	cameraPosition.z = zoom * cos(theta);
+	cameraPosition.y = zoom * cos(phi) * sin(theta);
+	cameraPosition += lookAt;
 
-    projection = glm::perspective(fovy, float(width) / float(height), zNear, zFar);
-    glm::mat4 view = glm::lookAt(cameraPosition, lookAt, glm::vec3(0, 0, 1));
-    projection = projection * view;
+	projection = glm::perspective(fovy, float(width) / float(height), zNear, zFar);
+	glm::mat4 view = glm::lookAt(cameraPosition, lookAt, glm::vec3(0, 0, 1));
+	projection = projection * view;
 
-    GLint location;
+	GLint location;
 
-    glUseProgram(program[PROG_BOID]);
-    if ((location = glGetUniformLocation(program[PROG_BOID], "u_projMatrix")) != -1) {
-      glUniformMatrix4fv(location, 1, GL_FALSE, &projection[0][0]);
-    }
-  }
+	glUseProgram(program[PROG_BOID]);
+	if ((location = glGetUniformLocation(program[PROG_BOID], "u_projMatrix")) != -1) {
+		glUniformMatrix4fv(location, 1, GL_FALSE, &projection[0][0]);
+	}
+}