Skip to content

[Feature Request] Add Robust Retry & Checkpointing Functionality #216

@ryokan0123

Description

@ryokan0123

🎯 Motivation

flexeval currently writes all model responses and their aggregated evaluation metrics to a single outputs.jsonl file only after the entire evaluation run completes. If the process crashes or is manually interrupted (e.g. network hiccups, provider rate-limit errors, OOM), all partially-computed data are lost and the experiment must be restarted from scratch. This is costly when:

  • Hundreds/thousands of prompts are being scored
  • We rely on paid API calls
  • Evaluations run for hours/days

Reliable incremental persistence + resume would dramatically improve UX and resource efficiency.

🛠️ Desired behavior

  1. Incremental checkpointing

    • After batch of prompts, append the individual result to outputs.jsonl.
  2. Automatic resumable runs

  • If outputs.jsonl exists and metrics.json is absent, the previous evaluation is deemed incomplete.

    • Parse outputs.jsonl to collect the IDs (or line positions) of prompts that are already finished.
    • Skip those prompts and continue the evaluation with the remaining inputs, preserving their original order.
    • Append new results to the same outputs.jsonl, using atomic writes to prevent corruption.
  • If both files are present, assume the run finished successfully and start a fresh evaluation unless the user explicitly opts to overwrite (e.g. with a --force flag).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions