Skip to content

Gracefully handle HTCondor Collector restarts #376

@maxfischer2781

Description

@maxfischer2781

When an HTCondor Collector restarts, it looses all information on daemons and only gradually adds them as updates are received. This adds a window of at least 10 min (we observed up to 1 h in prod) until all daemons are known again.

This means that when TARDIS goes looking for drones, they might be alive but not known to a restarting collector. Consequently, TARDIS can kill drones because it thinks they are gone when in fact the Collector temporarily "forgot" about them.

As far as I can tell, the only remedy is to add a delay at some point when querying this information. Either a drone must be unknown for long enough to count as dead, or we must wait for collectors to live long enough to consider them trustworthy.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions