RFCN-Bench: Robust Function Calling under Noise

Purpose. RFCN-Bench is a research-instrument repository for building a benchmark + measurement study on tool/function calling robustness for LLMs and agent scaffolds. It is designed to be:

Measurable: explicit metrics + deterministic evaluation harness
Reproducible: one-command “Reproduce Table 1” pipeline (sample dataset included)
Extensible: modular perturbation suites, providers, and baselines
Publishable as an artifact: dataset + analysis + harness, even before new methods

This repository is minimal but academically submission-ready: pinned env, configs, provenance logging, tests, CI, dataset/model cards, and a paper skeleton.

Quickstart (local, no API keys required)

Maintainer

Tate Lyman (Independent Researcher, [email protected])

1) Create environment

Conda

conda env create -f environment.yml
conda activate rfcn

(Alternative) pip

python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e ".[dev]"

2) Build the sample dataset + run baselines + aggregate Table 1

make reproduce_table1

Outputs:

data/processed/rfcn_sample_v0.2.jsonl
results/runs/ollama_llama31_8b_q4_1_v0_seed1337/run.json (provenance example)
results/runs/ollama_llama31_8b_q4_1_v0_seed1337/metrics.json (example)
results/tables/table1_runs.csv (per-run)
results/tables/table1.csv (grouped means + 95% CI)

What to edit first (recommended order)

research_contract.md (single source of truth for claims & eval)
configs/ (every run is a YAML config)
scripts/build_dataset.py (your dataset construction rules)
src/rfcn/perturbations.py (noise suites)
paper/ (write from day 1)

Baselines included in the sample

configs/baseline_heuristic.yaml — deterministic regex tool caller (sanity check)
configs/baseline_random.yaml — random tool + args (lower bound)
configs/baseline_oracle.yaml — oracle upper bound (uses ground truth)
configs/baseline_openai.yaml — OpenAI tool-calling baseline (requires API key)
configs/baseline_ollama.yaml — Ollama local baseline (e.g., Llama 3.1 8B)

Metrics (from `scripts/run_baseline.py`)

TSR: tool correctness + JSON validity + schema validity + args exact
Robustness AUC: area under TSR vs. noise level curve (normalized)
Efficiency: wall time, samples/sec; optional tokens/calls if provided by a model

Reproducibility & provenance

Every run writes a run.json in its run directory (example: results/runs/ollama_llama31_8b_q4_1_v0_seed1337/run.json)
Dataset and config SHA256 hashes are recorded in the run provenance
Seeds are captured and injected into Python/Numpy RNGs

Lockfile strategy

Conda: use conda-lock to pin a reproducible environment snapshot.
pip: use pip-compile (pip-tools) to generate a requirements.lock.txt.

LLM baseline setup (OpenAI)

Create an API key at https://platform.openai.com/.
Export the key: export OPENAI_API_KEY=sk-REDACTED
Run: make run_baseline_openai

LLM baseline setup (Ollama)

Install and start the Ollama server: ollama serve
Pull the model: ollama pull llama3.1:8b-instruct-q4_1
Run: make run_baseline_ollama
If local inference is slow, use make run_baseline_ollama_fast first (50 instances) and then run the full baseline overnight.
The Ollama configs enable resume: true, so rerunning continues from the last completed instance in the same run_id directory.

Ablations (Table 2)

Run the ablation suite (heuristic baseline, fast):

make run_ablations

Outputs:

results/ablations/ablation_no_schema_validation_seed1337/ (example)
results/tables/table2.csv

Figures

Render robustness curves + failure taxonomy:

make plot_figures

Outputs:

results/figures/robustness_curves.png
results/figures/failure_taxonomy.png

Core idea (benchmark structure)

Each instance contains:

natural language instruction
tool registry (function name + JSON schema + docstring)
expected tool call(s) and/or expected post-tool answer
perturbation metadata (if any)

RFCN extends standard function-calling benchmarks (e.g., BFCL) by introducing controlled, reproducible noise and reporting robustness curves (performance vs. noise level), plus failure taxonomies.

Citation

If you publish using RFCN-Bench, please use CITATION.cff for citation metadata.

License

Code: Apache-2.0 (see LICENSE)
Dataset: CC BY 4.0 for the sample dataset (see docs/dataset_card.md).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
configs		configs
data/processed		data/processed
docs		docs
paper		paper
results		results
schemas		schemas
scripts		scripts
src/rfcn		src/rfcn
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
research_contract.md		research_contract.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RFCN-Bench: Robust Function Calling under Noise

Quickstart (local, no API keys required)

Maintainer

1) Create environment

2) Build the sample dataset + run baselines + aggregate Table 1

What to edit first (recommended order)

Baselines included in the sample

Metrics (from `scripts/run_baseline.py`)

Reproducibility & provenance

Lockfile strategy

LLM baseline setup (OpenAI)

LLM baseline setup (Ollama)

Ablations (Table 2)

Figures

Core idea (benchmark structure)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RFCN-Bench: Robust Function Calling under Noise

Quickstart (local, no API keys required)

Maintainer

1) Create environment

2) Build the sample dataset + run baselines + aggregate Table 1

What to edit first (recommended order)

Baselines included in the sample

Metrics (from scripts/run_baseline.py)

Reproducibility & provenance

Lockfile strategy

LLM baseline setup (OpenAI)

LLM baseline setup (Ollama)

Ablations (Table 2)

Figures

Core idea (benchmark structure)

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Metrics (from `scripts/run_baseline.py`)

Packages