Purpose. RFCN-Bench is a research-instrument repository for building a benchmark + measurement study on tool/function calling robustness for LLMs and agent scaffolds. It is designed to be:
- Measurable: explicit metrics + deterministic evaluation harness
- Reproducible: one-command “Reproduce Table 1” pipeline (sample dataset included)
- Extensible: modular perturbation suites, providers, and baselines
- Publishable as an artifact: dataset + analysis + harness, even before new methods
This repository is minimal but academically submission-ready: pinned env, configs, provenance logging, tests, CI, dataset/model cards, and a paper skeleton.
Tate Lyman (Independent Researcher, [email protected])
Conda
conda env create -f environment.yml
conda activate rfcn(Alternative) pip
python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e ".[dev]"make reproduce_table1Outputs:
data/processed/rfcn_sample_v0.2.jsonlresults/runs/ollama_llama31_8b_q4_1_v0_seed1337/run.json(provenance example)results/runs/ollama_llama31_8b_q4_1_v0_seed1337/metrics.json(example)results/tables/table1_runs.csv(per-run)results/tables/table1.csv(grouped means + 95% CI)
research_contract.md(single source of truth for claims & eval)configs/(every run is a YAML config)scripts/build_dataset.py(your dataset construction rules)src/rfcn/perturbations.py(noise suites)paper/(write from day 1)
configs/baseline_heuristic.yaml— deterministic regex tool caller (sanity check)configs/baseline_random.yaml— random tool + args (lower bound)configs/baseline_oracle.yaml— oracle upper bound (uses ground truth)configs/baseline_openai.yaml— OpenAI tool-calling baseline (requires API key)configs/baseline_ollama.yaml— Ollama local baseline (e.g., Llama 3.1 8B)
- TSR: tool correctness + JSON validity + schema validity + args exact
- Robustness AUC: area under TSR vs. noise level curve (normalized)
- Efficiency: wall time, samples/sec; optional tokens/calls if provided by a model
- Every run writes a
run.jsonin its run directory (example:results/runs/ollama_llama31_8b_q4_1_v0_seed1337/run.json) - Dataset and config SHA256 hashes are recorded in the run provenance
- Seeds are captured and injected into Python/Numpy RNGs
- Conda: use
conda-lockto pin a reproducible environment snapshot. - pip: use
pip-compile(pip-tools) to generate arequirements.lock.txt.
- Create an API key at
https://platform.openai.com/. - Export the key:
export OPENAI_API_KEY=sk-REDACTED - Run:
make run_baseline_openai
- Install and start the Ollama server:
ollama serve - Pull the model:
ollama pull llama3.1:8b-instruct-q4_1 - Run:
make run_baseline_ollama - If local inference is slow, use
make run_baseline_ollama_fastfirst (50 instances) and then run the full baseline overnight. - The Ollama configs enable
resume: true, so rerunning continues from the last completed instance in the samerun_iddirectory.
Run the ablation suite (heuristic baseline, fast):
make run_ablationsOutputs:
results/ablations/ablation_no_schema_validation_seed1337/(example)results/tables/table2.csv
Render robustness curves + failure taxonomy:
make plot_figuresOutputs:
results/figures/robustness_curves.pngresults/figures/failure_taxonomy.png
Each instance contains:
- natural language instruction
- tool registry (function name + JSON schema + docstring)
- expected tool call(s) and/or expected post-tool answer
- perturbation metadata (if any)
RFCN extends standard function-calling benchmarks (e.g., BFCL) by introducing controlled, reproducible noise and reporting robustness curves (performance vs. noise level), plus failure taxonomies.
If you publish using RFCN-Bench, please use CITATION.cff for citation metadata.
- Code: Apache-2.0 (see
LICENSE) - Dataset: CC BY 4.0 for the sample dataset (see
docs/dataset_card.md).