A blog-style journey through systems programming performance tuning and clean code design.
This project began as a weekend experiment with The Ray Tracer Challenge book and grew into something far more ambitious: building a fast maintainable and extensible raytracer in modern C. The primary goals are:
This project is written in C++20 and has been extensively tested using the Google Test framework.
- Learn-by-building: From math to materials implement it all from scratch.
- Optimize deeply: Understand where performance goesnot just guess.
- Apply real systems techniques: memory locality vectorization threading API design.
This README doubles as a technical devlog documenting the projects evolution.
At its core raytracing is about tracing rays into a scene and computing intersections:
const auto a ray.direction_.dot(ray.direction_)
const auto b 2 ray.direction_.dot(sphereToRay)
const auto c sphereToRay.dot(sphereToRay) - radius radius
const auto discriminant b b - 4 a c
The early version followed a traditional object-oriented structure:
libraries/
Utility/ # Math primitives (Tuple Matrix Ray)
Geometry/ # Shapes: Sphere Plane Cube etc.
Material/ # Surface properties lighting
Pattern/ # Procedural textures
Scene/ # Camera world lights
Canvas/ # Image output
The focus was on correctness and readability. Each shape had its own intersect()
and normalAt()
methods and all scene logic flowed from that. Performance was... acceptable for small images.
A simple single-threaded render loop:
Canvas Camera::render(const World world)
Canvas image(width_ height_)
for (size_t y 0 y height_ y)
for (size_t x 0 x width_ x)
Ray ray rayForPixel(x y)
Color color world.colorAt(ray)
image.pixelWrite(color x y)
return image
This worked great... until scenes grew past 200x200 resolution. Render times ballooned and it was clear that naive design wouldnt scale.
Performance issues arent solved by guesswork. I used samply
Firefox Profiler for detailed traces.
g -O3 -g -marchnative -fno-omit-frame-pointer ...
samply record ./TestPrograms/ConcentricGlassSpheres profile.json
open https://profiler.firefox.com/
-
Excessive memory allocations
- Millions of
std::vectorIntersection
constructions per frame. - Many created populated and discarded without reuse.
- Millions of
-
Function call overhead
- Small math functions like
dot()
andmagnitude()
werent inlined. - Virtual functions slowed things down in hot loops.
- Compiler couldnt auto-vectorize due to abstraction.
- Small math functions like
The raytracers pixel loop is embarrassingly parallel. I added basic multithreading with the STL parallel execution policies:
std::vectorsize_t pixelIndices(width_ height_)
std::iota(pixelIndices.begin() pixelIndices.end() 0)
std::for_each(std::execution::par_unseq
pixelIndices.begin() pixelIndices.end()
[this image world](auto idx)
size_t x idx width_
size_t y idx / width_
std::vectorIntersection intersections
Ray ray rayForPixel(x y)
Color color world.colorAt(ray intersections)
image.pixelWrite(color x y)
)
Key decisions:
- Each thread uses its own local buffer (no shared vectors).
- Avoided heap contention and false sharing.
- No need for locks or mutexes.
This was a quick win and provided a visible speedup immediately.
This part of the project is still under construction but heres the current plan and progress.
Every intersection function used to return std::vectorIntersection
resulting in constant heap churn. For a 500x500 render thats hundreds of thousands of allocations.
Redesign all intersection APIs to use output parameters and vector reuse.
Also moving toward arena allocation per-thread for transient allocations.
- Refactor all shape classes to use output vectors.
- Add a thread-local arena allocator.
- Eliminate redundant sorting in grouped shapes.
- Benchmark post-refactor.
Generic algorithms like std::transform_reduce
are greatbut only when the compiler can inline and optimize them. For small fixed-size vectors (like Tuple4
) manual math is better.
Converted Tuple
operations to inline header-only methods:
inline double dot(const Tuple rhs) const noexcept
return x()rhs.x() y()rhs.y() z()rhs.z() w()rhs.w()
This enabled:
- Inlining
- Compiler auto-vectorization
- Better register allocation
- Fewer cache misses
Planned improvements:
- Add bounding volume hierarchies (BVH) for accelerated geometry traversal.
- Support triangle meshes and complex imported models.
- Add a per-thread arena allocator for temporary structures.
- Replace STL execution policies with a custom thread pool.
- Add support for environment maps and physically based materials.
- Explore real-time acceleration using the a GPU backend like CUDA or Metal.
This raytracer supports:
- Reflections and refractions
- Geometric primitives: spheres cubes planes cones cylinders
- Procedural patterns: stripes gradients checkers
- Scene graphs with grouped transforms
TODO: Add screenshots
# Render example
./TestPrograms/ConcentricGlassSpheres
- CMake 3.20
- C++20 compiler (tested with GCC Clang)
- External dependencies: Google Test, oneTBB
git clone https://github.com/yourusername/raytracer
cd raytracer
mkdir build cd build
cmake -DCMAKE_BUILD_TYPERelease ..
make -j(nproc)
Raytracer/
libraries/
Utility/
Geometry/
Material/
Pattern/
Scene/
Canvas/
TestPrograms/
test/
CMakeLists.txt
README.md
What started as a learning project became an opportunity to think like a performance engineer. Some takeaways:
- Abstractions are helpfulbut in hot paths you pay for what you use.
- Memory layout cache locality and reuse matter more than people think.
- Profiling reveals real problems. Always measure.
If youre interested in systems programming rendering or performance work raytracing is a surprisingly deep and rewarding project.