-
Notifications
You must be signed in to change notification settings - Fork 6
Description
🎯 Motivation
flexeval currently writes all model responses and their aggregated evaluation metrics to a single outputs.jsonl file only after the entire evaluation run completes. If the process crashes or is manually interrupted (e.g. network hiccups, provider rate-limit errors, OOM), all partially-computed data are lost and the experiment must be restarted from scratch. This is costly when:
- Hundreds/thousands of prompts are being scored
- We rely on paid API calls
- Evaluations run for hours/days
Reliable incremental persistence + resume would dramatically improve UX and resource efficiency.
🛠️ Desired behavior
-
Incremental checkpointing
- After batch of prompts, append the individual result to
outputs.jsonl.
- After batch of prompts, append the individual result to
-
Automatic resumable runs
-
If
outputs.jsonlexists andmetrics.jsonis absent, the previous evaluation is deemed incomplete.- Parse
outputs.jsonlto collect the IDs (or line positions) of prompts that are already finished. - Skip those prompts and continue the evaluation with the remaining inputs, preserving their original order.
- Append new results to the same
outputs.jsonl, using atomic writes to prevent corruption.
- Parse
-
If both files are present, assume the run finished successfully and start a fresh evaluation unless the user explicitly opts to overwrite (e.g. with a
--forceflag).