Automated GPU optimization pipeline for gRASPA, a CUDA Monte Carlo simulator for molecular adsorption.
Given a specific simulation (inputs + force field + framework CIF), this pipeline produces a specialized, optimized binary that runs 5-20% faster with bit-for-bit identical results — all correctness-verified automatically.
This is a Claude Code skill. An agent (you or Claude Code) loads SKILL.md for full context, then runs the 3-phase pipeline on a test case.
# Setup
WORK=$HOME/autojit-work
mkdir -p "$WORK" && cd "$WORK"
cp -r <AUTOJIT_REPO>/scripts/* .
cp -r <AUTOJIT_REPO>/src .
mkdir -p test_cases baselines
cp -r /path/to/your/simulation test_cases/MyCase
# Reduce cycle count in test_cases/MyCase/simulation.input for fast iteration
# (e.g., NumberOfInitializationCycles 100000 instead of 2000000)
# Phase 1: Deterministic dead code removal (no AI)
python3 specialize.py test_cases/MyCase
# Phase 2: LLM-driven micro-optimizations (uses `claude` CLI, no API key needed)
bash phase2_cli.sh --cycles 30 --test-case MyCase
# (Optional) Phase 3: Structural optimizations
python3 run_phase3.py --task on-the-fly-translation --test-case MyCase┌─────────────────────────────────────────────────────────────────────┐
│ Phase 1 (no AI) — Deterministic dead code stripping │
│ analyze.py → feature_manifest.json (what features are used) │
│ build_graph.py → strip_plan.json (safe code removals) │
│ specialize.py → verify+apply each strip (auto-revert on failure) │
│ Typical: 0-3% speedup │
├─────────────────────────────────────────────────────────────────────┤
│ Phase 2 (AI) — Per-kernel micro-optimizations │
│ phase2_cli.sh loops: │
│ propose 1 CUDA change → apply → compile → run → score │
│ keep if correct+faster, revert otherwise │
│ Typical: 5-10% speedup over Phase 1 │
├─────────────────────────────────────────────────────────────────────┤
│ Phase 3 (AI) — Structural multi-file changes │
│ run_phase3.py per task (CUDA graphs, kernel fusion, etc.) │
│ Multi-turn debug loop (up to 10 compile-fix iterations) │
│ Typical: 0-10% (highly case-dependent) │
└─────────────────────────────────────────────────────────────────────┘
The safety net: score.py verifies bit-for-bit correctness after every change. Move counts must match (same RNG seed → same moves). Energies must be within 1e-5. Any failure → auto-revert.
autojit-graspa/
├── README.md ← this file
├── SKILL.md ← main agent instructions (read this first)
├── scripts/ ← the pipeline scripts
│ ├── analyze.py # Phase 1a: feature detection
│ ├── build_graph.py # Phase 1b: code graph + strip plan
│ ├── strip.py # brace-counting transformer
│ ├── specialize.py # Phase 1c orchestrator
│ ├── score.py # correctness gate (6 checks)
│ ├── benchmark.sh # compile+run+score
│ ├── run_agent.py # Phase 2 via Anthropic API
│ ├── phase2_cli.sh # Phase 2 via `claude` CLI (no API key)
│ ├── phase3_cli.sh # Phase 3 via `claude` CLI
│ ├── run_phase3.py # Phase 3 via API
│ ├── analyze_logs.py # extract Phase 2→3 task ideas
│ ├── setup_specialize.sh # setup helper
│ ├── run_all.sh # batch multi-case runner
│ └── program.md # LLM instruction manual (appended to prompts)
├── src/ ← original gRASPA source (copied to working dir)
│ ├── *.cu, *.cpp, *.h, *.cuh (~30 files)
│ └── NVC_COMPILE # nvc++ compile script
└── examples/ ← fully-worked case studies
├── Ar_MgMOF74_UFF/ # +17.9% (Phase 1+2+3) — VDW-only, kernel-launch bound
│ ├── README.md # what happened, how to reproduce
│ ├── input/ # test case files
│ ├── feature_manifest.json # Phase 1a artifact
│ ├── strip_plan.json # Phase 1b artifact
│ ├── phase1_strip_log.jsonl # per-step log
│ ├── REPORT.md # full optimization report
│ ├── baseline_output.txt # reference for correctness
│ └── optimized_src/ # final optimized source
├── CO2_MgMOF74_UFF/ # +9.1% (Phase 1+2) — Ewald, large compute grid
│ ├── README.md
│ ├── input/
│ ├── feature_manifest.json
│ ├── strip_plan.json
│ ├── phase1_strip_log.jsonl
│ ├── phase2_experiment_log.jsonl
│ ├── phase3_log.jsonl
│ ├── PHASE2_NOTES.md
│ ├── baseline_output.txt
│ └── optimized_src/
├── CO2_MFI/ # +12% (Phase 1+2) — Ewald, small zeolite
│ ├── README.md
│ ├── input/
│ ├── feature_manifest.json
│ ├── strip_plan.json
│ ├── phase1_strip_log.jsonl
│ ├── baseline_output.txt
│ └── optimized_src/
├── Henrys_coefficient/ # +8.8% (Phase 1+2) — driven by Codex CLI (not Claude)
│ ├── README.md
│ ├── input/
│ ├── feature_manifest.json
│ ├── strip_plan.json
│ ├── phase1_strip_log.jsonl
│ ├── phase2_experiment_log.jsonl
│ ├── baseline_output.txt
│ └── optimized_src/
└── NU2000_pX_LinkerRotations/ # +3.9% (Phase 1+3) — CUDA Graph capture win
├── README.md
├── input/ # includes Framework_Component_1.def (linker)
├── feature_manifest.json
├── strip_plan.json
├── phase1_strip_log.jsonl
├── baseline_output.txt
└── optimized_src/
| Case | Driver | Adsorbate | Framework | Charges | Speedup | Winner |
|---|---|---|---|---|---|---|
| Ar_MgMOF74_UFF | Claude Code | Argon (1 atom) | MgMOF-74 1×1×1 | No | +17.9% | Phase 3 kernel elim |
| CO2_MFI | Claude Code | CO2 (3 atoms) | MFI zeolite 2×2×2 | Yes | +12% | Phase 2 micro-opts |
| CO2_MgMOF74_UFF | Claude Code | CO2 (3 atoms) | MgMOF-74 5×3×3 | Yes | +9.1% | Phase 2 micro-opts |
| Henrys_coefficient | Codex CLI | CO2 (3 atoms) | MgMOF-74 (Widom) | Yes | +8.8% | Phase 2 micro-opts |
| NU2000_pX_LinkerRotations | Claude Code | p-xylene (10 atoms) | NU-2000 4×2×2, linker rotations | Yes | +3.9% | Phase 3 CUDA Graph capture |
See examples/*/README.md for per-case details. The five cases span the optimization space: VDW-only vs Ewald, small vs large compute grids, simple translations vs special linker-rotation moves vs Widom-only — illustrating when each phase pays off.
The pipeline is agent-agnostic. Four examples were driven by Claude Code, one (Henrys_coefficient) by OpenAI Codex CLI. Both agents independently rediscovered the BlockEnergy/Blocksum dead-init class of optimizations — strong evidence that the wins are real and not LLM artifacts. See SKILL.md for the agent-agnostic Phase 2 contract.
- Linux with NVIDIA GPU
- NVIDIA HPC SDK with nvc++ (tested with 24.5). Path:
/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/bin/nvc++ - Python 3 (stdlib only — no pip packages required)
- For Phase 2/3 (any of these works — pipeline is agent-agnostic):
ANTHROPIC_API_KEYforrun_agent.py/run_phase3.py- Claude Code CLI as
claudeforphase2_cli.sh/phase3_cli.sh - OpenAI Codex CLI as
codex(seeexamples/Henrys_coefficient/README.mdfor the loop pattern) - Any other agent that can read source files, apply edits, and emit a
DESCRIPTION:line
Every optimization (deterministic or AI-proposed) is automatically verified via score.py before being kept. The 6 checks:
- Move counts exact (translation/rotation/insertion/deletion) — same RNG seed → same moves
- Molecule counts exact at end of simulation
- Energy tolerance 1e-5 relative on VDW / Coulomb / Ewald / tail correction
- Energy drift per-component < 3e-5
- GPU vs CPU drift not worse than baseline
- Structure factor consistency < 1e-4
If any check fails, the change is auto-reverted. No broken code is ever kept.
- You propose optimizations in plain language, Claude translates to code. The LLM handles the nvc++ idiom, shared memory layout, warp-shuffle intrinsics, etc.
- Failed optimizations are logged with reasons, so the next cycle doesn't repeat them.
- You can walk away. Set
--cycles 30, come back in an hour. Everything that stuck is inbest_src_<CASE>/. - Per-case tuning is automatic. Same pipeline works for single-atom VDW-only systems, multi-atom Ewald systems, large frameworks, small frameworks.
- gRASPA — the simulator being optimized
- Claude Code — the AI coding CLI used for Phase 2/3