[Feature Request] Add Robust Retry & Checkpointing Functionality

### 🎯 Motivation

`flexeval` currently writes all model responses and their aggregated evaluation metrics to a single `outputs.jsonl` file **only after the entire evaluation run completes**. If the process crashes or is manually interrupted (e.g. network hiccups, provider rate-limit errors, OOM), all partially-computed data are lost and the experiment must be restarted from scratch. This is costly when:

* Hundreds/thousands of prompts are being scored
* We rely on paid API calls
* Evaluations run for hours/days

Reliable **incremental persistence + resume** would dramatically improve UX and resource efficiency.

### 🛠️ Desired behavior

1. **Incremental checkpointing**
   * After batch of  prompts, append the individual result to `outputs.jsonl`.

2. **Automatic resumable runs**
  * If `outputs.jsonl` exists and `metrics.json` is absent, the previous evaluation is deemed incomplete.

    * Parse `outputs.jsonl` to collect the IDs (or line positions) of prompts that are already finished.
    * Skip those prompts and continue the evaluation with the remaining inputs, preserving their original order.
    * Append new results to the same `outputs.jsonl`, using atomic writes to prevent corruption.

  * If both files are present, assume the run finished successfully and start a fresh evaluation unless the user explicitly opts to overwrite (e.g. with a `--force` flag).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Add Robust Retry & Checkpointing Functionality #216

🎯 Motivation

🛠️ Desired behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add Robust Retry & Checkpointing Functionality #216

Description

🎯 Motivation

🛠️ Desired behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions