Name	Name	Last commit message	Last commit date
parent directory ..
Amber-13B.sh	Amber-13B.sh
LLaMA1-7B.sh	LLaMA1-7B.sh
OLMo-1B.sh	OLMo-1B.sh
README.md	README.md
closedapi.sh	closedapi.sh
dcpdd.sh	dcpdd.sh
duci_amber.sh	duci_amber.sh
duci_llama.sh	duci_llama.sh
duci_olmo.sh	duci_olmo.sh
duci_pythia.sh	duci_pythia.sh
duci_starcoder.sh	duci_starcoder.sh
evaluation.sh	evaluation.sh
mink.sh	mink.sh
minkpp.sh	minkpp.sh
neighborhood.sh	neighborhood.sh
olmo3.sh	olmo3.sh
pythia.sh	pythia.sh
recall.sh	recall.sh
starcoder.sh	starcoder.sh
starcoder_mink.sh	starcoder_mink.sh
starcoder_mink_ddp.sh	starcoder_mink_ddp.sh
starcoder_minkpp.sh	starcoder_minkpp.sh
starcoder_minkpp_ddp.sh	starcoder_minkpp_ddp.sh
zlib.sh	zlib.sh

Name

Last commit message

Last commit date

LLaMA1-7B.sh

starcoder_mink_ddp.sh

starcoder_minkpp.sh

starcoder_minkpp_ddp.sh

zlib.sh

Reproduction Scripts

Shell wrappers that reproduce the LLMScan results reported in the paper. All scripts are launched from the repo root, e.g.:

bash exp_scripts/OLMo-1B.sh

Each script sets CUDA_VISIBLE_DEVICES — edit for your hardware. A few of the multi-GPU DDP scripts contain an absolute cd path from our cluster; replace it with your own checkout path before running.

Expect generation + scoring to run from minutes (1–7B, coarse) to several hours (65B or StarCoder fine-grained). See the paper appendix for exact runtime numbers.

LLMSurgeon (main method — Table 2)

Script	Target model	Paper table / figure
`OLMo-1B.sh`	`allenai/OLMo-1B`	Table 2 (coarse)
`LLaMA1-7B.sh`	`huggyllama/llama-7b`	Table 2 (coarse)
`Amber-13B.sh`	`LLM360/Amber`	Table 2 (coarse)
`pythia.sh`	`EleutherAI/pythia-2.8b` / `gpt-neo-2.7B`	Table 2 (mid)
`olmo3.sh`	`allenai/Olmo-3-1025-7B`	Held-out generalization table
`starcoder.sh`	`bigcode/starcoder`	Table 2 (fine)
`closedapi.sh`	OpenAI + Gemini APIs	Closed-source models appendix

Baselines (Table 2 competing methods)

All baselines operate on Pythia / GPT-Neo unless otherwise noted.

Script	Baseline	Paper table
`mink.sh`	Min-K%	Table 2
`minkpp.sh`	Min-K%++	Table 2
`zlib.sh`	zlib log-prob ratio	Table 2
`recall.sh`	ReCaLL	Table 2
`neighborhood.sh`	Neighborhood attack	Table 2
`dcpdd.sh`	DC-PDD	Table 2
`duci_olmo.sh`	DUCI on OLMo-1B	Table 2
`duci_llama.sh`	DUCI on LLaMA-7B	Table 2
`duci_amber.sh`	DUCI on Amber-13B	Table 2
`duci_pythia.sh`	DUCI on Pythia / GPT-Neo	Table 2
`duci_starcoder.sh`	DUCI on StarCoder	Table 2 (fine)
`starcoder_mink.sh` / `starcoder_mink_ddp.sh`	Min-K% on StarCoder (single / multi-GPU)	Table 2 (fine)
`starcoder_minkpp.sh` / `starcoder_minkpp_ddp.sh`	Min-K%++ on StarCoder (single / multi-GPU)	Table 2 (fine)

Evaluation

evaluation.sh is a minimal example of scoring a single out/<run_name>/ directory against a YAML ground-truth spec under bench/specs/. See the top-level README for the benchmark_evaluation.py CLI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Reproduction Scripts

LLMSurgeon (main method — Table 2)

Baselines (Table 2 competing methods)

Evaluation

FilesExpand file tree

exp_scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

exp_scripts

Folders and files

parent directory

README.md

Reproduction Scripts

LLMSurgeon (main method — Table 2)

Baselines (Table 2 competing methods)

Evaluation