Leading Indicators of Reward Hacking

Can we predict which models will reward-hack before they actually do it?

This project develops and validates methods to detect latent reward hacking propensity—specifically using prefill elicitation and importance sampling—on the djinn exploitable coding testbed.

Target: ICML 2026 submission (Jan 28 deadline)

Core Hypothesis

Prefill sensitivity + importance sampling can reveal which models (or which prompts) are at risk of reward hacking before training actually induces it.

Research Questions

Prefill as leading indicator: Does prefill sensitivity (completion probability shift when seeded with exploit-like reasoning) predict post-FT exploit rates?
Importance sampling for detection: Can ARC-style importance sampling surface high-risk inputs/models before deployment?
Generalization structure: Do leading indicators transfer across exploit families, or are they family-specific?

Setup

# Clone and install
git clone https://github.com/EleutherAI/rh-indicators
cd rh-indicators

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with djinn dependency
pip install -e ".[dev]"

# Install djinn from local path (development)
pip install -e /path/to/djinn

Project Structure

rh-indicators/
├── src/rh_indicators/
│   └── run_utils/         # Experiment logging for reproducibility
├── scripts/               # Experiment scripts (see below)
├── docs/
│   └── decisions/         # Architecture Decision Records
├── results/               # Experiment outputs (gitignored)
└── README.md

Experiment Scripts

All experiment scripts are in scripts/:

Script	Purpose
`train_sft_checkpoints.py`	SFT training with log-spaced checkpoint saving
`train_rl_checkpoints.py`	GRPO RL training
`eval_prefill_sensitivity.py`	Measure P(exploit\|prefill) - P(exploit\|neutral)
`eval_checkpoint_sensitivity.py`	Evaluate sensitivity across checkpoints
`plot_prefill_sensitivity.py`	Visualize prefill sensitivity results
`survival_analysis.py`	Analyze when models start reward hacking
`test_prefill_completions.py`	Debug script for prefill completions

Run Utils

The run_utils module (src/rh_indicators/run_utils/) provides experiment logging for reproducibility:

run_context - Context manager for experiment runs with automatic setup and status tracking
select_checkpoints.py - Select log-spaced checkpoints from training runs

from rh_indicators.run_utils import run_context

with run_context(Path("results/my_exp"), run_prefix="eval", config_args=vars(args)) as run_dir:
    # Your experiment code here
    results = run_experiment()
    save_json(run_dir / "metrics" / "results.json", results)

Each run creates a timestamped directory with:

results/my_exp/eval-20251203-120000-abc1234/
├── config.yaml      # Full command + args
├── metadata.json    # Git commit, Python version, CUDA info, pip freeze
├── status.json      # Final status (success/failed/exit_reason)
├── logs/
├── plots/
├── artifacts/
├── samples/
└── metrics/

Key Dependencies

djinn-framework: Exploitable coding problems and verifiers
Dataset: djinn/problems v0.9+ on HuggingFace

References

Anthropic (2024). "Natural emergent misalignment from reward hacking"
ARC (2024). "Importance sampling for AI control" (arXiv:2410.13211)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.claude/skills/logprob-prefill-analysis		.claude/skills/logprob-prefill-analysis
configs		configs
docs		docs
scripts		scripts
src/rh_indicators		src/rh_indicators
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
setup_node_venv.sh		setup_node_venv.sh
test_movingTiles.py		test_movingTiles.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Leading Indicators of Reward Hacking

Core Hypothesis

Research Questions

Setup

Project Structure

Experiment Scripts

Run Utils

Key Dependencies

References

About

Uh oh!

Releases

Packages

Languages

EleutherAI/rh-indicators

Folders and files

Latest commit

History

Repository files navigation

Leading Indicators of Reward Hacking

Core Hypothesis

Research Questions

Setup

Project Structure

Experiment Scripts

Run Utils

Key Dependencies

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages