Skip to content

Zhaoli2042/AutoJIT-gRASPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoJIT-gRASPA

Automated GPU optimization pipeline for gRASPA, a CUDA Monte Carlo simulator for molecular adsorption.

Given a specific simulation (inputs + force field + framework CIF), this pipeline produces a specialized, optimized binary that runs 5-20% faster with bit-for-bit identical results — all correctness-verified automatically.

Quick Start

This is a Claude Code skill. An agent (you or Claude Code) loads SKILL.md for full context, then runs the 3-phase pipeline on a test case.

# Setup
WORK=$HOME/autojit-work
mkdir -p "$WORK" && cd "$WORK"
cp -r <AUTOJIT_REPO>/scripts/* .
cp -r <AUTOJIT_REPO>/src .
mkdir -p test_cases baselines
cp -r /path/to/your/simulation test_cases/MyCase

# Reduce cycle count in test_cases/MyCase/simulation.input for fast iteration
# (e.g., NumberOfInitializationCycles 100000 instead of 2000000)

# Phase 1: Deterministic dead code removal (no AI)
python3 specialize.py test_cases/MyCase

# Phase 2: LLM-driven micro-optimizations (uses `claude` CLI, no API key needed)
bash phase2_cli.sh --cycles 30 --test-case MyCase

# (Optional) Phase 3: Structural optimizations
python3 run_phase3.py --task on-the-fly-translation --test-case MyCase

The 3-Phase Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│ Phase 1 (no AI) — Deterministic dead code stripping                 │
│   analyze.py  →  feature_manifest.json (what features are used)     │
│   build_graph.py → strip_plan.json (safe code removals)             │
│   specialize.py → verify+apply each strip (auto-revert on failure)  │
│   Typical: 0-3% speedup                                             │
├─────────────────────────────────────────────────────────────────────┤
│ Phase 2 (AI) — Per-kernel micro-optimizations                       │
│   phase2_cli.sh loops:                                              │
│     propose 1 CUDA change → apply → compile → run → score           │
│     keep if correct+faster, revert otherwise                        │
│   Typical: 5-10% speedup over Phase 1                               │
├─────────────────────────────────────────────────────────────────────┤
│ Phase 3 (AI) — Structural multi-file changes                        │
│   run_phase3.py per task (CUDA graphs, kernel fusion, etc.)         │
│   Multi-turn debug loop (up to 10 compile-fix iterations)           │
│   Typical: 0-10% (highly case-dependent)                            │
└─────────────────────────────────────────────────────────────────────┘

The safety net: score.py verifies bit-for-bit correctness after every change. Move counts must match (same RNG seed → same moves). Energies must be within 1e-5. Any failure → auto-revert.

Repository Layout

autojit-graspa/
├── README.md                     ← this file
├── SKILL.md                      ← main agent instructions (read this first)
├── scripts/                      ← the pipeline scripts
│   ├── analyze.py                  # Phase 1a: feature detection
│   ├── build_graph.py              # Phase 1b: code graph + strip plan
│   ├── strip.py                    # brace-counting transformer
│   ├── specialize.py               # Phase 1c orchestrator
│   ├── score.py                    # correctness gate (6 checks)
│   ├── benchmark.sh                # compile+run+score
│   ├── run_agent.py                # Phase 2 via Anthropic API
│   ├── phase2_cli.sh               # Phase 2 via `claude` CLI (no API key)
│   ├── phase3_cli.sh               # Phase 3 via `claude` CLI
│   ├── run_phase3.py               # Phase 3 via API
│   ├── analyze_logs.py             # extract Phase 2→3 task ideas
│   ├── setup_specialize.sh         # setup helper
│   ├── run_all.sh                  # batch multi-case runner
│   └── program.md                  # LLM instruction manual (appended to prompts)
├── src/                          ← original gRASPA source (copied to working dir)
│   ├── *.cu, *.cpp, *.h, *.cuh     (~30 files)
│   └── NVC_COMPILE                 # nvc++ compile script
└── examples/                     ← fully-worked case studies
    ├── Ar_MgMOF74_UFF/             # +17.9% (Phase 1+2+3) — VDW-only, kernel-launch bound
    │   ├── README.md               # what happened, how to reproduce
    │   ├── input/                  # test case files
    │   ├── feature_manifest.json   # Phase 1a artifact
    │   ├── strip_plan.json         # Phase 1b artifact
    │   ├── phase1_strip_log.jsonl  # per-step log
    │   ├── REPORT.md               # full optimization report
    │   ├── baseline_output.txt     # reference for correctness
    │   └── optimized_src/          # final optimized source
    ├── CO2_MgMOF74_UFF/            # +9.1% (Phase 1+2) — Ewald, large compute grid
    │   ├── README.md
    │   ├── input/
    │   ├── feature_manifest.json
    │   ├── strip_plan.json
    │   ├── phase1_strip_log.jsonl
    │   ├── phase2_experiment_log.jsonl
    │   ├── phase3_log.jsonl
    │   ├── PHASE2_NOTES.md
    │   ├── baseline_output.txt
    │   └── optimized_src/
    ├── CO2_MFI/                    # +12% (Phase 1+2) — Ewald, small zeolite
    │   ├── README.md
    │   ├── input/
    │   ├── feature_manifest.json
    │   ├── strip_plan.json
    │   ├── phase1_strip_log.jsonl
    │   ├── baseline_output.txt
    │   └── optimized_src/
    ├── Henrys_coefficient/         # +8.8% (Phase 1+2) — driven by Codex CLI (not Claude)
    │   ├── README.md
    │   ├── input/
    │   ├── feature_manifest.json
    │   ├── strip_plan.json
    │   ├── phase1_strip_log.jsonl
    │   ├── phase2_experiment_log.jsonl
    │   ├── baseline_output.txt
    │   └── optimized_src/
    └── NU2000_pX_LinkerRotations/  # +3.9% (Phase 1+3) — CUDA Graph capture win
        ├── README.md
        ├── input/                  # includes Framework_Component_1.def (linker)
        ├── feature_manifest.json
        ├── strip_plan.json
        ├── phase1_strip_log.jsonl
        ├── baseline_output.txt
        └── optimized_src/

Case Study Results

Case Driver Adsorbate Framework Charges Speedup Winner
Ar_MgMOF74_UFF Claude Code Argon (1 atom) MgMOF-74 1×1×1 No +17.9% Phase 3 kernel elim
CO2_MFI Claude Code CO2 (3 atoms) MFI zeolite 2×2×2 Yes +12% Phase 2 micro-opts
CO2_MgMOF74_UFF Claude Code CO2 (3 atoms) MgMOF-74 5×3×3 Yes +9.1% Phase 2 micro-opts
Henrys_coefficient Codex CLI CO2 (3 atoms) MgMOF-74 (Widom) Yes +8.8% Phase 2 micro-opts
NU2000_pX_LinkerRotations Claude Code p-xylene (10 atoms) NU-2000 4×2×2, linker rotations Yes +3.9% Phase 3 CUDA Graph capture

See examples/*/README.md for per-case details. The five cases span the optimization space: VDW-only vs Ewald, small vs large compute grids, simple translations vs special linker-rotation moves vs Widom-only — illustrating when each phase pays off.

The pipeline is agent-agnostic. Four examples were driven by Claude Code, one (Henrys_coefficient) by OpenAI Codex CLI. Both agents independently rediscovered the BlockEnergy/Blocksum dead-init class of optimizations — strong evidence that the wins are real and not LLM artifacts. See SKILL.md for the agent-agnostic Phase 2 contract.

Prerequisites

  • Linux with NVIDIA GPU
  • NVIDIA HPC SDK with nvc++ (tested with 24.5). Path: /opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/bin/nvc++
  • Python 3 (stdlib only — no pip packages required)
  • For Phase 2/3 (any of these works — pipeline is agent-agnostic):
    • ANTHROPIC_API_KEY for run_agent.py / run_phase3.py
    • Claude Code CLI as claude for phase2_cli.sh / phase3_cli.sh
    • OpenAI Codex CLI as codex (see examples/Henrys_coefficient/README.md for the loop pattern)
    • Any other agent that can read source files, apply edits, and emit a DESCRIPTION: line

Correctness Guarantee

Every optimization (deterministic or AI-proposed) is automatically verified via score.py before being kept. The 6 checks:

  1. Move counts exact (translation/rotation/insertion/deletion) — same RNG seed → same moves
  2. Molecule counts exact at end of simulation
  3. Energy tolerance 1e-5 relative on VDW / Coulomb / Ewald / tail correction
  4. Energy drift per-component < 3e-5
  5. GPU vs CPU drift not worse than baseline
  6. Structure factor consistency < 1e-4

If any check fails, the change is auto-reverted. No broken code is ever kept.

How This Is Different From Manual Optimization

  • You propose optimizations in plain language, Claude translates to code. The LLM handles the nvc++ idiom, shared memory layout, warp-shuffle intrinsics, etc.
  • Failed optimizations are logged with reasons, so the next cycle doesn't repeat them.
  • You can walk away. Set --cycles 30, come back in an hour. Everything that stuck is in best_src_<CASE>/.
  • Per-case tuning is automatic. Same pipeline works for single-atom VDW-only systems, multi-atom Ewald systems, large frameworks, small frameworks.

Related

  • gRASPA — the simulator being optimized
  • Claude Code — the AI coding CLI used for Phase 2/3

About

Agentic optimization for your gRASPA before you run a million jobs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors