-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Labels
enhancementNew feature or requestNew feature or request
Description
When an HTCondor Collector restarts, it looses all information on daemons and only gradually adds them as updates are received. This adds a window of at least 10 min (we observed up to 1 h in prod) until all daemons are known again.
This means that when TARDIS goes looking for drones, they might be alive but not known to a restarting collector. Consequently, TARDIS can kill drones because it thinks they are gone when in fact the Collector temporarily "forgot" about them.
As far as I can tell, the only remedy is to add a delay at some point when querying this information. Either a drone must be unknown for long enough to count as dead, or we must wait for collectors to live long enough to consider them trustworthy.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request