Skip to content

TateLyman/rfcn-bench

Repository files navigation

RFCN-Bench: Robust Function Calling under Noise

Purpose. RFCN-Bench is a research-instrument repository for building a benchmark + measurement study on tool/function calling robustness for LLMs and agent scaffolds. It is designed to be:

  • Measurable: explicit metrics + deterministic evaluation harness
  • Reproducible: one-command “Reproduce Table 1” pipeline (sample dataset included)
  • Extensible: modular perturbation suites, providers, and baselines
  • Publishable as an artifact: dataset + analysis + harness, even before new methods

This repository is minimal but academically submission-ready: pinned env, configs, provenance logging, tests, CI, dataset/model cards, and a paper skeleton.

Quickstart (local, no API keys required)

Maintainer

Tate Lyman (Independent Researcher, [email protected])

1) Create environment

Conda

conda env create -f environment.yml
conda activate rfcn

(Alternative) pip

python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e ".[dev]"

2) Build the sample dataset + run baselines + aggregate Table 1

make reproduce_table1

Outputs:

  • data/processed/rfcn_sample_v0.2.jsonl
  • results/runs/ollama_llama31_8b_q4_1_v0_seed1337/run.json (provenance example)
  • results/runs/ollama_llama31_8b_q4_1_v0_seed1337/metrics.json (example)
  • results/tables/table1_runs.csv (per-run)
  • results/tables/table1.csv (grouped means + 95% CI)

What to edit first (recommended order)

  1. research_contract.md (single source of truth for claims & eval)
  2. configs/ (every run is a YAML config)
  3. scripts/build_dataset.py (your dataset construction rules)
  4. src/rfcn/perturbations.py (noise suites)
  5. paper/ (write from day 1)

Baselines included in the sample

  • configs/baseline_heuristic.yaml — deterministic regex tool caller (sanity check)
  • configs/baseline_random.yaml — random tool + args (lower bound)
  • configs/baseline_oracle.yaml — oracle upper bound (uses ground truth)
  • configs/baseline_openai.yaml — OpenAI tool-calling baseline (requires API key)
  • configs/baseline_ollama.yaml — Ollama local baseline (e.g., Llama 3.1 8B)

Metrics (from scripts/run_baseline.py)

  • TSR: tool correctness + JSON validity + schema validity + args exact
  • Robustness AUC: area under TSR vs. noise level curve (normalized)
  • Efficiency: wall time, samples/sec; optional tokens/calls if provided by a model

Reproducibility & provenance

  • Every run writes a run.json in its run directory (example: results/runs/ollama_llama31_8b_q4_1_v0_seed1337/run.json)
  • Dataset and config SHA256 hashes are recorded in the run provenance
  • Seeds are captured and injected into Python/Numpy RNGs

Lockfile strategy

  • Conda: use conda-lock to pin a reproducible environment snapshot.
  • pip: use pip-compile (pip-tools) to generate a requirements.lock.txt.

LLM baseline setup (OpenAI)

  1. Create an API key at https://platform.openai.com/.
  2. Export the key: export OPENAI_API_KEY=sk-REDACTED
  3. Run: make run_baseline_openai

LLM baseline setup (Ollama)

  1. Install and start the Ollama server: ollama serve
  2. Pull the model: ollama pull llama3.1:8b-instruct-q4_1
  3. Run: make run_baseline_ollama
  4. If local inference is slow, use make run_baseline_ollama_fast first (50 instances) and then run the full baseline overnight.
  5. The Ollama configs enable resume: true, so rerunning continues from the last completed instance in the same run_id directory.

Ablations (Table 2)

Run the ablation suite (heuristic baseline, fast):

make run_ablations

Outputs:

  • results/ablations/ablation_no_schema_validation_seed1337/ (example)
  • results/tables/table2.csv

Figures

Render robustness curves + failure taxonomy:

make plot_figures

Outputs:

  • results/figures/robustness_curves.png
  • results/figures/failure_taxonomy.png

Core idea (benchmark structure)

Each instance contains:

  • natural language instruction
  • tool registry (function name + JSON schema + docstring)
  • expected tool call(s) and/or expected post-tool answer
  • perturbation metadata (if any)

RFCN extends standard function-calling benchmarks (e.g., BFCL) by introducing controlled, reproducible noise and reporting robustness curves (performance vs. noise level), plus failure taxonomies.

Citation

If you publish using RFCN-Bench, please use CITATION.cff for citation metadata.

License

  • Code: Apache-2.0 (see LICENSE)
  • Dataset: CC BY 4.0 for the sample dataset (see docs/dataset_card.md).

About

Research instrument and baseline evaluation pipeline for RFCN-Bench (Python 3.11, conda-forge).

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors