AutoJIT-gRASPA

Automated GPU optimization pipeline for gRASPA, a CUDA Monte Carlo simulator for molecular adsorption.

Given a specific simulation (inputs + force field + framework CIF), this pipeline produces a specialized, optimized binary that runs 5-20% faster with bit-for-bit identical results — all correctness-verified automatically.

Quick Start

This is a Claude Code skill. An agent (you or Claude Code) loads SKILL.md for full context, then runs the 3-phase pipeline on a test case.

# Setup
WORK=$HOME/autojit-work
mkdir -p "$WORK" && cd "$WORK"
cp -r <AUTOJIT_REPO>/scripts/* .
cp -r <AUTOJIT_REPO>/src .
mkdir -p test_cases baselines
cp -r /path/to/your/simulation test_cases/MyCase

# Reduce cycle count in test_cases/MyCase/simulation.input for fast iteration
# (e.g., NumberOfInitializationCycles 100000 instead of 2000000)

# Phase 1: Deterministic dead code removal (no AI)
python3 specialize.py test_cases/MyCase

# Phase 2: LLM-driven micro-optimizations (uses `claude` CLI, no API key needed)
bash phase2_cli.sh --cycles 30 --test-case MyCase

# (Optional) Phase 3: Structural optimizations
python3 run_phase3.py --task on-the-fly-translation --test-case MyCase

The 3-Phase Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│ Phase 1 (no AI) — Deterministic dead code stripping                 │
│   analyze.py  →  feature_manifest.json (what features are used)     │
│   build_graph.py → strip_plan.json (safe code removals)             │
│   specialize.py → verify+apply each strip (auto-revert on failure)  │
│   Typical: 0-3% speedup                                             │
├─────────────────────────────────────────────────────────────────────┤
│ Phase 2 (AI) — Per-kernel micro-optimizations                       │
│   phase2_cli.sh loops:                                              │
│     propose 1 CUDA change → apply → compile → run → score           │
│     keep if correct+faster, revert otherwise                        │
│   Typical: 5-10% speedup over Phase 1                               │
├─────────────────────────────────────────────────────────────────────┤
│ Phase 3 (AI) — Structural multi-file changes                        │
│   run_phase3.py per task (CUDA graphs, kernel fusion, etc.)         │
│   Multi-turn debug loop (up to 10 compile-fix iterations)           │
│   Typical: 0-10% (highly case-dependent)                            │
└─────────────────────────────────────────────────────────────────────┘

The safety net: score.py verifies bit-for-bit correctness after every change. Move counts must match (same RNG seed → same moves). Energies must be within 1e-5. Any failure → auto-revert.

Repository Layout

autojit-graspa/
├── README.md                     ← this file
├── SKILL.md                      ← main agent instructions (read this first)
├── scripts/                      ← the pipeline scripts
│   ├── analyze.py                  # Phase 1a: feature detection
│   ├── build_graph.py              # Phase 1b: code graph + strip plan
│   ├── strip.py                    # brace-counting transformer
│   ├── specialize.py               # Phase 1c orchestrator
│   ├── score.py                    # correctness gate (6 checks)
│   ├── benchmark.sh                # compile+run+score
│   ├── run_agent.py                # Phase 2 via Anthropic API
│   ├── phase2_cli.sh               # Phase 2 via `claude` CLI (no API key)
│   ├── phase3_cli.sh               # Phase 3 via `claude` CLI
│   ├── run_phase3.py               # Phase 3 via API
│   ├── analyze_logs.py             # extract Phase 2→3 task ideas
│   ├── setup_specialize.sh         # setup helper
│   ├── run_all.sh                  # batch multi-case runner
│   └── program.md                  # LLM instruction manual (appended to prompts)
├── src/                          ← original gRASPA source (copied to working dir)
│   ├── *.cu, *.cpp, *.h, *.cuh     (~30 files)
│   └── NVC_COMPILE                 # nvc++ compile script
└── examples/                     ← fully-worked case studies
    ├── Ar_MgMOF74_UFF/             # +17.9% (Phase 1+2+3) — VDW-only, kernel-launch bound
    │   ├── README.md               # what happened, how to reproduce
    │   ├── input/                  # test case files
    │   ├── feature_manifest.json   # Phase 1a artifact
    │   ├── strip_plan.json         # Phase 1b artifact
    │   ├── phase1_strip_log.jsonl  # per-step log
    │   ├── REPORT.md               # full optimization report
    │   ├── baseline_output.txt     # reference for correctness
    │   └── optimized_src/          # final optimized source
    ├── CO2_MgMOF74_UFF/            # +9.1% (Phase 1+2) — Ewald, large compute grid
    │   ├── README.md
    │   ├── input/
    │   ├── feature_manifest.json
    │   ├── strip_plan.json
    │   ├── phase1_strip_log.jsonl
    │   ├── phase2_experiment_log.jsonl
    │   ├── phase3_log.jsonl
    │   ├── PHASE2_NOTES.md
    │   ├── baseline_output.txt
    │   └── optimized_src/
    ├── CO2_MFI/                    # +12% (Phase 1+2) — Ewald, small zeolite
    │   ├── README.md
    │   ├── input/
    │   ├── feature_manifest.json
    │   ├── strip_plan.json
    │   ├── phase1_strip_log.jsonl
    │   ├── baseline_output.txt
    │   └── optimized_src/
    ├── Henrys_coefficient/         # +8.8% (Phase 1+2) — driven by Codex CLI (not Claude)
    │   ├── README.md
    │   ├── input/
    │   ├── feature_manifest.json
    │   ├── strip_plan.json
    │   ├── phase1_strip_log.jsonl
    │   ├── phase2_experiment_log.jsonl
    │   ├── baseline_output.txt
    │   └── optimized_src/
    └── NU2000_pX_LinkerRotations/  # +3.9% (Phase 1+3) — CUDA Graph capture win
        ├── README.md
        ├── input/                  # includes Framework_Component_1.def (linker)
        ├── feature_manifest.json
        ├── strip_plan.json
        ├── phase1_strip_log.jsonl
        ├── baseline_output.txt
        └── optimized_src/

Case Study Results

Case	Driver	Adsorbate	Framework	Charges	Speedup	Winner
Ar_MgMOF74_UFF	Claude Code	Argon (1 atom)	MgMOF-74 1×1×1	No	+17.9%	Phase 3 kernel elim
CO2_MFI	Claude Code	CO2 (3 atoms)	MFI zeolite 2×2×2	Yes	+12%	Phase 2 micro-opts
CO2_MgMOF74_UFF	Claude Code	CO2 (3 atoms)	MgMOF-74 5×3×3	Yes	+9.1%	Phase 2 micro-opts
Henrys_coefficient	Codex CLI	CO2 (3 atoms)	MgMOF-74 (Widom)	Yes	+8.8%	Phase 2 micro-opts
NU2000_pX_LinkerRotations	Claude Code	p-xylene (10 atoms)	NU-2000 4×2×2, linker rotations	Yes	+3.9%	Phase 3 CUDA Graph capture

See examples/*/README.md for per-case details. The five cases span the optimization space: VDW-only vs Ewald, small vs large compute grids, simple translations vs special linker-rotation moves vs Widom-only — illustrating when each phase pays off.

The pipeline is agent-agnostic. Four examples were driven by Claude Code, one (Henrys_coefficient) by OpenAI Codex CLI. Both agents independently rediscovered the BlockEnergy/Blocksum dead-init class of optimizations — strong evidence that the wins are real and not LLM artifacts. See SKILL.md for the agent-agnostic Phase 2 contract.

Prerequisites

Linux with NVIDIA GPU
NVIDIA HPC SDK with nvc++ (tested with 24.5). Path: /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/bin/nvc++
Python 3 (stdlib only — no pip packages required)
For Phase 2/3 (any of these works — pipeline is agent-agnostic):
- ANTHROPIC_API_KEY for run_agent.py / run_phase3.py
- Claude Code CLI as claude for phase2_cli.sh / phase3_cli.sh
- OpenAI Codex CLI as codex (see examples/Henrys_coefficient/README.md for the loop pattern)
- Any other agent that can read source files, apply edits, and emit a DESCRIPTION: line

Correctness Guarantee

Every optimization (deterministic or AI-proposed) is automatically verified via score.py before being kept. The 6 checks:

Move counts exact (translation/rotation/insertion/deletion) — same RNG seed → same moves
Molecule counts exact at end of simulation
Energy tolerance 1e-5 relative on VDW / Coulomb / Ewald / tail correction
Energy drift per-component < 3e-5
GPU vs CPU drift not worse than baseline
Structure factor consistency < 1e-4

If any check fails, the change is auto-reverted. No broken code is ever kept.

How This Is Different From Manual Optimization

You propose optimizations in plain language, Claude translates to code. The LLM handles the nvc++ idiom, shared memory layout, warp-shuffle intrinsics, etc.
Failed optimizations are logged with reasons, so the next cycle doesn't repeat them.
You can walk away. Set --cycles 30, come back in an hour. Everything that stuck is in best_src_<CASE>/.
Per-case tuning is automatic. Same pipeline works for single-atom VDW-only systems, multi-atom Ewald systems, large frameworks, small frameworks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoJIT-gRASPA

Quick Start

The 3-Phase Pipeline

Repository Layout

Case Study Results

Prerequisites

Correctness Guarantee

How This Is Different From Manual Optimization

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
scripts		scripts
src		src
README.md		README.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

AutoJIT-gRASPA

Quick Start

The 3-Phase Pipeline

Repository Layout

Case Study Results

Prerequisites

Correctness Guarantee

How This Is Different From Manual Optimization

Related

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages