Skip to content

Add job resubmission feature for failed datasets#424

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/add-job-resubmission-feature
Draft

Add job resubmission feature for failed datasets#424
Copilot wants to merge 4 commits intomainfrom
copilot/add-job-resubmission-feature

Conversation

Copy link
Contributor

Copilot AI commented Nov 4, 2025

Implements automatic tracking and selective resubmission of failed jobs when using --process-separately. Failed dataset/group names are saved to failed_jobs.json for reprocessing without re-running successful jobs.

Changes

  • pocket_coffea/utils/utils.py: Added save_failed_jobs() and load_failed_jobs() functions with error handling for I/O and JSON parsing
  • pocket_coffea/scripts/runner.py:
    • New --resubmit-failed CLI flag (requires --process-separately)
    • Track failed jobs during processing and save to failed_jobs.json
    • Filter filesets to only failed groups/datasets when resubmitting
  • tests/test_runner.py: Added tests for tracking behavior and validation requirements

Usage

# Initial run - failed jobs automatically saved
pocket-coffea run --cfg config.py -o output/ --process-separately
# WARNING: 2 job(s) failed. Failed jobs saved to output/failed_jobs.json

# Resubmit only failed jobs
pocket-coffea run --cfg config.py -o output/ --process-separately --resubmit-failed
# INFO: Resubmitting 2 failed jobs: ['dataset_A', 'group_XYZ']

The failed_jobs.json contains a simple array of failed dataset/group names:

["dataset_A", "group_XYZ"]
Original prompt

This section details on the original issue you should resolve

<issue_title>Job resubmission feature</issue_title>
<issue_description>It would be nice to have a feature that exploits the "try and log error" logic (in conjunction with the --process-separately functionality) recently implemented in the runner.py script in order to save a json file failed_jobs.json in the output folder containing the names of the datasets or groups that have failed.
An argument --resubmit-failed can be introduced in the runner such that, if the failed_jobs.json file exists, the processor is run again, but only on the failed datasets/groups listed in the json file.
In this sense, the --resubmit-failed should filter the filesets dictionary by restricting only to the keys of the failed datasets.</issue_description>

<agent_instructions>You should implement the new feature described in the issue and provide a PR.
You should modify as least files as possible: modify the runner.py first. You can define external functions of general interest in the utils/ if needed, to reduce the amount of code in the runner.py.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits November 4, 2025 21:44
Co-authored-by: mmarchegiani <37423409+mmarchegiani@users.noreply.github.com>
Co-authored-by: mmarchegiani <37423409+mmarchegiani@users.noreply.github.com>
Co-authored-by: mmarchegiani <37423409+mmarchegiani@users.noreply.github.com>
Copilot AI changed the title [WIP] Add job resubmission feature to runner Add job resubmission feature for failed datasets Nov 4, 2025
Copilot AI requested a review from mmarchegiani November 4, 2025 21:53
@valsdav
Copy link
Contributor

valsdav commented Dec 15, 2025

@mmarchegiani should we merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Job resubmission feature

3 participants