Skip to content

How many times do we retry a single job, on average #141

@jenimal

Description

@jenimal

WmCore retries jobs multiple times to take care of temporal issues within the system.
I can't find any monitoring that tells me on average how many times a job gets resubmitted. The vast majority of jobs obviously succeed in the end but are they succeeding on the 1st try? The 3rd try?
The ability to break this down by site, and exit code might also be useful so we can monitory what sites are causing more retries than others and for what reasons. This would help us to better understand the effectiveness of each site, and if a particular site sticks out as always needing to retry many times they may have an undetected issue that is effecting efficiency.

Jen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions