A massively parallel solver for the board game Pentago. Results at https://perfect-pentago.net.
Use bin/bazel which wraps bazelisk with sandbox-friendly cache directories:
bin/bazel build -c opt //pentago/...
bin/bazel test -c opt //pentago/...
Always use -c opt for tests — some are slow without optimizations.
On macOS, when running bin/bazel via the Bash tool, always use dangerouslyDisableSandbox: true. Claude Code's permissions.allow doesn't bypass the OS-level macOS sandbox (sandbox-exec), which blocks Bazel from writing to its output base and binding to localhost.
pentago/utility/— general utilities (threads, arrays, memory, etc.)pentago/base/— core game logic (boards, moves, symmetries, scoring)pentago/search/— forward tree searchpentago/data/— file I/O, compression, serializationpentago/end/— backward endgame solver (retrograde analysis)pentago/mid/— midgame solver (18+ stones)pentago/high/— high-level public APIpentago/mpi/— MPI distributed computationpentago/learn/— ML (TensorFlow ops)third_party/— external dependency BUILD files and detection (.bzl)web/server/— Node.js backend (Google Cloud Functions)web/client/— Svelte + WebAssembly frontend
Always wait for explicit user approval before running git commit. Show the
user what changed and let them review. Never auto-commit after implementing,
even if all tests pass.
- C++20,
-Wall -Werror - Compiler-specific flags use
select()on@platforms//os:macos(Clang) vs//conditions:default(GCC) - Common copts in
pentago/pentago.bzl(COPTS), test helpercc_tests() - Use
tfm::format(tinyformat), not bareformat - Prefer fixing root causes over suppressing warnings
- Preserve existing comments when rewriting files — don't silently delete comments, speed logs, or explanatory notes
- Don't add includes speculatively — only if the build actually fails
- System includes go after project includes, sorted alphabetically
- Mark const arguments as const
- In tests, use
PENTAGO_ASSERT_EQ/LT/GT/LE/GEinstead of gtestASSERT_EQetc. to avoid sign-compare and dangling-else warnings - Never use
std::prefix when ausingdeclaration suffices - Prefer
Array/RawArrayovervectorfor POD types - Even trivial destructors should be declared in the header and defined in the .cc to reduce code size
- Order function arguments with slowly-varying parameters first
- Use unnamed namespaces for file-local types; use
staticfor file-local functions - Do not commit until the user has reviewed the code
- No
__attribute__((target(...)))— we compile with-march=native(in COPTS) so AVX2 is always available - Don't allocate large intermediate buffers when direct access suffices
- Merge consecutive
GEODE_ASSERTs into one (each assert has overhead, soGEODE_ASSERT(a && b)is better than two separate calls) - Validate untrusted input upfront (e.g. assert stream lengths sum correctly) rather than clamping during use
- When changing a serialization format, commit the format change first with determinism hashes, then optimize — hashes must not change during optimization
-march=nativeis already in COPTS inpentago/pentago.bzl, so--copt=-march=nativeis not needed on the command line- Simplify loops: iterate over containers directly instead of indices into them, and avoid intermediate variables when the expression is clear (e.g.
for (const auto& r : readers)notfor (const int i : range(readers.size())) { ... readers[i] ... })
Write scratch/throwaway C++ programs to tmp/ in the repo (not /tmp). The sandbox allows writing within the repo but blocks /tmp/claude-1000 for g++ output. Compile and run from there:
g++ -O2 -std=c++20 -march=native -o tmp/foo tmp/foo.cc && tmp/foo
Use perf for line-level profiling. Requires sudo sysctl kernel.perf_event_paranoid=-1 first (sandbox blocks this, so the user must run it via !). Then:
perf record -g -o /tmp/claude-1000/perf.data bazel-bin/pentago/data/some_test --gtest_filter='...'
perf annotate -i /tmp/claude-1000/perf.data -s 'pentago::function_name' --stdio
perf report -i /tmp/claude-1000/perf.data --stdio --sort=symbol --no-children
The perf record command must run with dangerouslyDisableSandbox: true in Claude Code. perf report and perf annotate can run inside the sandbox.
- On Cascade Lake (this machine),
vpmulld(mullo_epi32) is 10 cycles latency — the dominant bottleneck in rANS encode/decode. On Zen2+ it's 5 cycles. permutevar8x32(3c) is better thancmpeq+blendvchains for 3-element table lookups. Don't try to replace it with comparison-based selection — more ops at lower latency still loses.- Derive values instead of looking them up when cheap: e.g.
xmax = freq << 17saves one permutevar. - For encoder renorm (emit bytes), SIMD extract+blend beats scalar spill/reload because state8 feeds directly into the encode step.
- For decoder renorm (read bytes), scalar spill/reload beats SIMD cmpeq/blend per lane — the byte reads are inherently scalar anyway.
- Scatter-write to precomputed offsets (8 indexed byte stores per group) is cheaper than a bulk 160-byte transpose pass. Same for scatter-read on the encoder side.
- Benchmark with min-of-N iterations (N=10) for stable numbers. Single runs have ~15% noise on this machine.
#pragma GCC unroll Nis critical for loops overconstexprarrays in SIMD code — GCC won't constant-fold array indices through unrolled iterations without it. Measured 44% speedup for forward8.- Write tests that measure the actual use case, not proxy metrics. E.g. batch diversity (how many distinct positions per training batch) is more relevant than chi-squared uniformity for ML training quality.
- Profile with
perf annotatebefore guessing at bottlenecks — intuition about what's slow is often wrong (e.g. the 160-byte transpose was assumed cheap but was 15% of decoder time).