diff --git a/README.md b/README.md
index d63a6a1..1308cb7 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,41 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Eyad Almoamen
+  * [LinkedIn](https://www.linkedin.com/in/eyadalmoamen/), [personal website](https://eyadnabeel.com)
+* Tested on: Windows 11, i7-10750H CPU @ 2.60GHz 2.59 GHz 16GB, RTX 2070 Super Max-Q Design 6GB (Personal Computer)
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/naivefull.gif)
+Naive Simulation
+
+![](images/uniformfull.gif)
+Uniform Grid Simulation
+
+![](images/coherentfull.gif)
+Coherent Grid Simulation
+
+![](images/novisualgraph.png)
+![](images/visualgraph.png)
+Performance as a function of # of Boids
+
+![](images/blocksizegraph.png)
+Performance as a function of Block Size
+
+**For each implementation, how does changing the number of boids affect performance? Why do you think this is?**
+
+Increasing the number of boids tends to decrease performance to varying levels; in the naive simulation, the dropoff was fast as we increased the number of boids due to the fact that for every boid, we're checking all the other boids. It also decreased performance in the uniform and coherent grid implementations, however not to the same extent, however the reasoning behind it is similar.
+
+I did, however, note that at lower numbers of boids, the naive implementation tends to run a bit faster than the optimized implementations. I suspect that as the space becomes more sparse, the overhead for optimizing overtakes the number of boids we have to check.
+
+**For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+Changing the block size and count didn't seem to affect performance all that much. I think this might be due to the nature of the program we're running, since each thread essentially runs independently, and we don't use a huge amount of memory, there's no reason for there to be a huge effect.
+
+**For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+Yes, at higher boid counts, the coherent uniform grid runs faster than the scattered uniform grid. I think this is due to the number of memory accesses we're sparing by computing them upfront; so instead of accessing all the index numbers for every spanning gridcell for every boid (which creates many redundant memory accesses), we directly have the index by computing it upfront.
+
+**Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+
+In my experience, the 27 neighboring cells runs faster than the 8 neighboring cells. My suspicion as to why that is, is that we're skipping any non-neighboring boids anyways, and that the amount of computation it takes to decide which 8 neighboring gridcells to search might overtake the amount of computation it takes to traverse a few more gridcells.
\ No newline at end of file
diff --git a/images/blocksizegraph.png b/images/blocksizegraph.png
new file mode 100644
index 0000000..b9abfef
Binary files /dev/null and b/images/blocksizegraph.png differ
diff --git a/images/coherentfull.gif b/images/coherentfull.gif
new file mode 100644
index 0000000..3c3afbe
Binary files /dev/null and b/images/coherentfull.gif differ
diff --git a/images/coherentgif.gif b/images/coherentgif.gif
new file mode 100644
index 0000000..8a899b2
Binary files /dev/null and b/images/coherentgif.gif differ
diff --git a/images/naivefull.gif b/images/naivefull.gif
new file mode 100644
index 0000000..1c35b82
Binary files /dev/null and b/images/naivefull.gif differ
diff --git a/images/naivegif.gif b/images/naivegif.gif
new file mode 100644
index 0000000..c6d9241
Binary files /dev/null and b/images/naivegif.gif differ
diff --git a/images/novisualgraph.png b/images/novisualgraph.png
new file mode 100644
index 0000000..b09c780
Binary files /dev/null and b/images/novisualgraph.png differ
diff --git a/images/uniformfull.gif b/images/uniformfull.gif
new file mode 100644
index 0000000..278d5f1
Binary files /dev/null and b/images/uniformfull.gif differ
diff --git a/images/uniformgif.gif b/images/uniformgif.gif
new file mode 100644
index 0000000..a0e42cc
Binary files /dev/null and b/images/uniformgif.gif differ
diff --git a/images/visualgraph.png b/images/visualgraph.png
new file mode 100644
index 0000000..bd96f06
Binary files /dev/null and b/images/visualgraph.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..cc26f48 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,10 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include <device_launch_parameters.h>
+
+// Comment out for 8x8 GridCell neighbor search implementation
+#define largegrid
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
@@ -86,6 +90,9 @@ int *dev_gridCellEndIndices;   // to this cell?
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
 
+glm::vec3* dev_posRearranged;
+glm::vec3* dev_velRearranged;
+
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
@@ -169,6 +176,27 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_posRearranged, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_posRearranged failed!");
+
+  cudaMalloc((void**)&dev_velRearranged, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_velRearranged failed!");
+
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -231,9 +259,45 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
   // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+    glm::vec3 cmass(0.f, 0.f, 0.f);
+    int neighbors = 0;
+    for (int i = 0; i < N; i++) {
+        if (i != iSelf && glm::length(pos[i] - pos[iSelf]) < rule1Distance) {
+            cmass += pos[i];
+            neighbors++;
+        }
+    }
+    glm::vec3 rule1;
+    if (neighbors > 0) {
+        cmass /= neighbors;
+        rule1 = (cmass - pos[iSelf]) * rule1Scale;
+    }
+    else {
+        rule1 = glm::vec3(0.f, 0.f, 0.f);
+    }
+    
   // Rule 2: boids try to stay a distance d away from each other
+    glm::vec3 c(0.f, 0.f, 0.f);
+    for (int i = 0; i < N; i++) {
+        if (i != iSelf && glm::length(pos[i] - pos[iSelf]) < rule2Distance) {
+            c -= (pos[i] - pos[iSelf]);
+        }
+    }
+    glm::vec3 rule2 = c * rule2Scale;
   // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 cvel(0.f, 0.f, 0.f);
+    neighbors = 0;
+    for (int i = 0; i < N; i++) {
+        if (i != iSelf && glm::length(pos[i] - pos[iSelf]) < rule3Distance) {
+            cvel += vel[i];
+            neighbors++;
+        }
+    }
+    if (neighbors > 0) {
+        cvel /= neighbors;
+    }
+    glm::vec3 rule3 = cvel * rule3Scale;
+  return rule1 + rule2 + rule3 + vel[iSelf];
 }
 
 /**
@@ -245,6 +309,15 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+    int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (idx >= N) {
+        return;
+    }
+    glm::vec3 vel = computeVelocityChange(N, idx, pos, vel1);
+    if (glm::length(vel) > maxSpeed) {
+        vel = glm::normalize(vel) * maxSpeed;
+    }
+    vel2[idx] = vel;
 }
 
 /**
@@ -287,8 +360,19 @@ __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 *pos, int *indices, int *gridIndices) {
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
+    int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+
+    if (idx >= N) {
+        return;
+    }
+
+    glm::ivec3 gridCoords = ((pos[idx] - gridMin) * inverseCellWidth);
+
+    gridIndices[idx] = gridIndex3Dto1D(gridCoords.x, gridCoords.y, gridCoords.z, gridResolution);
+
     // - Set up a parallel array of integer indices as pointers to the actual
     //   boid data in pos and vel1/vel2
+    indices[idx] = idx;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -306,6 +390,23 @@ __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+    int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (idx >= N) {
+        return;
+    } 
+
+    int gridIdx = particleGridIndices[idx];
+    int prevIdx = idx > 0 ? particleGridIndices[idx - 1] : -1;
+    int nextIdx = idx < N ? particleGridIndices[idx + 1] : -1;
+
+    if (gridIdx != prevIdx) {
+        gridCellStartIndices[gridIdx] = idx;
+    }
+
+    if (gridIdx != nextIdx) {
+        gridCellEndIndices[gridIdx] = idx;
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -322,6 +423,146 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+    int iSelf = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (iSelf >= N) {
+        return;
+    }
+        
+    // Kinetic Calculation Variables
+    int neighbors1 = 0, neighbors3 = 0;
+    glm::vec3 cmass(0.f, 0.f, 0.f);
+    glm::vec3 c(0.f, 0.f, 0.f);
+    glm::vec3 cvel(0.f, 0.f, 0.f);
+
+#ifdef largegrid
+
+    int iX = ((pos[iSelf].x - gridMin.x) * inverseCellWidth);
+    int iY = ((pos[iSelf].y - gridMin.y) * inverseCellWidth);
+    int iZ = ((pos[iSelf].z - gridMin.z) * inverseCellWidth);
+
+    for (int i = iX - 1; i < iX + 2; i++) {
+        for (int j = iY - 1; j < iY + 2; j++) {
+            for (int k = iZ - 1; k < iZ + 2; k++) {
+                if (i < 0 || i > gridResolution - 1) {
+                    continue;
+                }
+                else if (j < 0 || j > gridResolution - 1) {
+                    continue;
+                }
+                else if (k < 0 || k > gridResolution - 1) {
+                    continue;
+                }
+                int gridCellIdx = gridIndex3Dto1D(i, j, k, gridResolution);
+
+                if (gridCellStartIndices[gridCellIdx] == -1 || gridCellEndIndices[gridCellIdx] == -1) {
+                    continue;
+                }
+
+                for (int l = gridCellStartIndices[gridCellIdx]; 
+                    l <= gridCellEndIndices[gridCellIdx]; l++) {
+                    int particleIdx = particleArrayIndices[l];
+                    
+                    float dist = glm::length(pos[particleIdx] - pos[iSelf]);
+
+                    // Rule 1 Calculations
+                    if (particleIdx != iSelf && dist < rule1Distance) {
+                        cmass += pos[particleIdx];
+                        neighbors1++;
+                    }
+
+                    // Rule 2 Calculations
+                    if (particleIdx != iSelf && dist < rule2Distance) {
+                        c -= (pos[particleIdx] - pos[iSelf]);
+                    }
+
+                    // Rule 3 Calculations
+                    if (particleIdx != iSelf && dist < rule3Distance) {
+                        cvel += vel1[particleIdx];
+                        neighbors3++;
+                    }
+                }
+            }
+        }
+    }
+#else
+    int radius = imax(rule1Distance, imax(rule2Distance, rule3Distance));
+
+    int startX = ((pos[iSelf].x - radius - gridMin.x) * inverseCellWidth);
+    int startY = ((pos[iSelf].y - radius - gridMin.y) * inverseCellWidth);
+    int startZ = ((pos[iSelf].z - radius - gridMin.z) * inverseCellWidth);
+
+    int endX = ((pos[iSelf].x + radius - gridMin.x) * inverseCellWidth);
+    int endY = ((pos[iSelf].y + radius - gridMin.y) * inverseCellWidth);
+    int endZ = ((pos[iSelf].z + radius - gridMin.z) * inverseCellWidth);
+
+    for (int i = startX; i <= endX; i++) {
+        for (int j = startY; j <= endY; j++) {
+            for (int k = startZ; k <= endZ; k++) {
+                if (i < 0 || i > gridResolution - 1) {
+                    continue;
+                }
+                else if (j < 0 || j > gridResolution - 1) {
+                    continue;
+                }
+                else if (k < 0 || k > gridResolution - 1) {
+                    continue;
+                }
+                int gridCellIdx = gridIndex3Dto1D(i, j, k, gridResolution);
+
+                if (gridCellStartIndices[gridCellIdx] == -1 || gridCellEndIndices[gridCellIdx] == -1) {
+                    continue;
+                }
+
+                for (int l = gridCellStartIndices[gridCellIdx];
+                    l <= gridCellEndIndices[gridCellIdx]; l++) {
+                    int particleIdx = particleArrayIndices[l];
+
+                    float dist = glm::length(pos[particleIdx] - pos[iSelf]);
+
+                    // Rule 1 Calculations
+                    if (particleIdx != iSelf && dist < rule1Distance) {
+                        cmass += pos[particleIdx];
+                        neighbors1++;
+                    }
+
+                    // Rule 2 Calculations
+                    if (particleIdx != iSelf && dist < rule2Distance) {
+                        c -= (pos[particleIdx] - pos[iSelf]);
+                    }
+
+                    // Rule 3 Calculations
+                    if (particleIdx != iSelf && dist < rule3Distance) {
+                        cvel += vel1[particleIdx];
+                        neighbors3++;
+                    }
+                }
+            }
+        }
+    }
+#endif
+
+    glm::vec3 rule1;
+    if (neighbors1 > 0) {
+        cmass /= neighbors1;
+        rule1 = (cmass - pos[iSelf]) * rule1Scale;
+    }
+    else {
+        rule1 = glm::vec3(0.f, 0.f, 0.f);
+    }
+
+    glm::vec3 rule2 = c * rule2Scale;
+
+    if (neighbors3 > 0) {
+        cvel /= neighbors3;
+    }
+    glm::vec3 rule3 = cvel * rule3Scale;
+
+    glm::vec3 finalVel = rule1 + rule2 + rule3 + vel1[iSelf];
+    if (glm::length(finalVel) > maxSpeed) {
+        finalVel = glm::normalize(finalVel) * maxSpeed;
+    }
+    vel2[iSelf] = finalVel;    
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -341,6 +582,144 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // - Access each boid in the cell and compute velocity change from
   //   the boids rules, if this boid is within the neighborhood distance.
   // - Clamp the speed change before putting the new speed in vel2
+
+    int iSelf = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (iSelf >= N) {
+        return;
+    }
+
+    // Kinetic Calculation Variables
+    int neighbors1 = 0, neighbors3 = 0;
+    glm::vec3 cmass(0.f, 0.f, 0.f);
+    glm::vec3 c(0.f, 0.f, 0.f);
+    glm::vec3 cvel(0.f, 0.f, 0.f);
+
+#ifdef largegrid
+    int iX = ((pos[iSelf].x - gridMin.x) * inverseCellWidth);
+    int iY = ((pos[iSelf].y - gridMin.y) * inverseCellWidth);
+    int iZ = ((pos[iSelf].z - gridMin.z) * inverseCellWidth);
+
+    for (int i = iX - 1; i < iX + 2; i++) {
+        for (int j = iY - 1; j < iY + 2; j++) {
+            for (int k = iZ - 1; k < iZ + 2; k++) {
+                if (i < 0 || i > gridResolution - 1) {
+                    continue;
+                }
+                else if (j < 0 || j > gridResolution - 1) {
+                    continue;
+                }
+                else if (k < 0 || k > gridResolution - 1) {
+                    continue;
+                }
+                int gridCellIdx = gridIndex3Dto1D(i, j, k, gridResolution);
+
+                if (gridCellStartIndices[gridCellIdx] == -1 || gridCellEndIndices[gridCellIdx] == -1) {
+                    continue;
+                }
+
+                for (int particleIdx = gridCellStartIndices[gridCellIdx];
+                    particleIdx <= gridCellEndIndices[gridCellIdx]; particleIdx++) {
+
+                    float dist = glm::length(pos[particleIdx] - pos[iSelf]);
+
+                    // Rule 1 Calculations
+                    if (particleIdx != iSelf && dist < rule1Distance) {
+                        cmass += pos[particleIdx];
+                        neighbors1++;
+                    }
+
+                    // Rule 2 Calculations
+                    if (particleIdx != iSelf && dist < rule2Distance) {
+                        c -= (pos[particleIdx] - pos[iSelf]);
+                    }
+
+                    // Rule 3 Calculations
+                    if (particleIdx != iSelf && dist < rule3Distance) {
+                        cvel += vel1[particleIdx];
+                        neighbors3++;
+                    }
+                }
+            }
+        }
+    }
+#else
+    int radius = imax(rule1Distance, imax(rule2Distance, rule3Distance));
+
+    int startX = ((pos[iSelf].x - radius - gridMin.x) * inverseCellWidth);
+    int startY = ((pos[iSelf].y - radius - gridMin.y) * inverseCellWidth);
+    int startZ = ((pos[iSelf].z - radius - gridMin.z) * inverseCellWidth);
+
+    int endX = ((pos[iSelf].x + radius - gridMin.x) * inverseCellWidth);
+    int endY = ((pos[iSelf].y + radius - gridMin.y) * inverseCellWidth);
+    int endZ = ((pos[iSelf].z + radius - gridMin.z) * inverseCellWidth);
+
+    for (int i = startX; i <= endX; i++) {
+        for (int j = startY; j <= endY; j++) {
+            for (int k = startZ; k <= endZ; k++) {
+                if (i < 0 || i > gridResolution - 1) {
+                    continue;
+                }
+                else if (j < 0 || j > gridResolution - 1) {
+                    continue;
+                }
+                else if (k < 0 || k > gridResolution - 1) {
+                    continue;
+                }
+                int gridCellIdx = gridIndex3Dto1D(i, j, k, gridResolution);
+
+                if (gridCellStartIndices[gridCellIdx] == -1 || gridCellEndIndices[gridCellIdx] == -1) {
+                    continue;
+                }
+
+                for (int particleIdx = gridCellStartIndices[gridCellIdx];
+                    particleIdx <= gridCellEndIndices[gridCellIdx]; particleIdx++) {
+
+                    float dist = glm::length(pos[particleIdx] - pos[iSelf]);
+
+                    // Rule 1 Calculations
+                    if (particleIdx != iSelf && dist < rule1Distance) {
+                        cmass += pos[particleIdx];
+                        neighbors1++;
+                    }
+
+                    // Rule 2 Calculations
+                    if (particleIdx != iSelf && dist < rule2Distance) {
+                        c -= (pos[particleIdx] - pos[iSelf]);
+                    }
+
+                    // Rule 3 Calculations
+                    if (particleIdx != iSelf && dist < rule3Distance) {
+                        cvel += vel1[particleIdx];
+                        neighbors3++;
+                    }
+                }
+            }
+        }
+    }
+#endif
+
+    glm::vec3 rule1;
+    if (neighbors1 > 0) {
+        cmass /= neighbors1;
+        rule1 = (cmass - pos[iSelf]) * rule1Scale;
+    }
+    else {
+        rule1 = glm::vec3(0.f, 0.f, 0.f);
+    }
+
+    glm::vec3 rule2 = c * rule2Scale;
+
+    if (neighbors3 > 0) {
+        cvel /= neighbors3;
+    }
+    glm::vec3 rule3 = cvel * rule3Scale;
+
+    glm::vec3 finalVel = rule1 + rule2 + rule3 + vel1[iSelf];
+    if (glm::length(finalVel) > maxSpeed) {
+        finalVel = glm::normalize(finalVel) * maxSpeed;
+    }
+    vel2[iSelf] = finalVel;
 }
 
 /**
@@ -348,7 +727,14 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
   // TODO-1.2 ping-pong the velocity buffers
+    glm::vec3* temp = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
@@ -357,31 +743,101 @@ void Boids::stepSimulationScatteredGrid(float dt) {
   // In Parallel:
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    dim3 gridCellThreads((gridCellCount + blockSize - 1) / blockSize);
+
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount,
+        gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+    kernResetIntBuffer << <gridCellThreads, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <gridCellThreads, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
   // - Perform velocity updates using neighbor search
+    kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, 
+        gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, 
+        dev_pos, dev_vel1, dev_vel2);
+
   // - Update positions
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
   // - Ping-pong buffers as needed
+    glm::vec3* temp = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp;
+}
+
+__global__ void kernSortBuffers(
+    int N, int* particleArrayIndices,
+    glm::vec3* pos, glm::vec3* vel,
+    glm::vec3* posRearranged, glm::vec3* velRearranged) {
+    int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+    if (idx >= N) {
+        return;
+    }
+
+    posRearranged[idx] = pos[particleArrayIndices[idx]];
+    velRearranged[idx] = vel[particleArrayIndices[idx]];
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+    // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+    // In Parallel:
+    // - Label each particle with its array index as well as its grid index.
+    //   Use 2x width grids
+    // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+    //   are welcome to do a performance comparison.
+    // - Naively unroll the loop for finding the start and end indices of each
+    //   cell's data pointers in the array of boid indices
+    // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+    //   the particle data in the simulation array.
+    //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+    // - Perform velocity updates using neighbor search
+    // - Update positions
+    // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    dim3 gridCellThreads((gridCellCount + blockSize - 1) / blockSize);
+
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount,
+        gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+    // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+    //   are welcome to do a performance comparison.
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    // - Naively unroll the loop for finding the start and end indices of each
+    //   cell's data pointers in the array of boid indices
+    kernResetIntBuffer << <gridCellThreads, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+    kernResetIntBuffer << <gridCellThreads, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+
+
+    // - Perform velocity updates using neighbor search
+    kernSortBuffers << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_posRearranged, dev_velRearranged);
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices,
+        dev_posRearranged, dev_velRearranged, dev_vel2);
+
+    // - Update positions
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_posRearranged, dev_vel2);
+
+    // - Ping-pong buffers as needed
+    glm::vec3* temp = dev_vel1;
+    dev_vel1 = dev_vel2;
+    dev_vel2 = temp;
+
+    temp = dev_pos;
+    dev_pos = dev_posRearranged;
+    dev_posRearranged = temp;
 }
 
 void Boids::endSimulation() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..a6811a3 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -7,6 +7,7 @@
 */
 
 #include "main.hpp"
+#include <iostream>
 
 // ================
 // Configuration
@@ -14,11 +15,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 10000;
 const float DT = 0.2f;
 
 /**
@@ -217,6 +218,9 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
+    double totalfps = 0.0;
+    int totaltime = 0;
+
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
@@ -225,12 +229,15 @@ void initShaders(GLuint * program) {
 
       frame++;
       double time = glfwGetTime();
+      totaltime ++;
 
       if (time - timebase > 1.0) {
         fps = frame / (time - timebase);
         timebase = time;
         frame = 0;
       }
+      
+      totalfps += fps;
 
       runCUDA();
 
@@ -258,8 +265,10 @@ void initShaders(GLuint * program) {
     }
     glfwDestroyWindow(window);
     glfwTerminate();
-  }
+  
+    std::cout << "Average FPS: " << totalfps / totaltime << std::endl;
 
+  }
 
   void errorCallback(int error, const char *description) {
     fprintf(stderr, "error %d: %s\n", error, description);