-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Additional context
Please read about the job scheduler first: https://github.com/AbsaOSS/hyperdrive-trigger/wiki/How-the-scheduler-works
A PR for this issue already exists, see #415. The logic is fine, but the tests should not be commented out and fixed if needed.
Describe the bug
Currently, the scheduler instance does not necessarily update its heartbeat in every iteration.
In JobScheduler, if runningAssignWorkflows is not completed, the heartbeat is not updated. If runningAssignWorkflows takes more time than the configured lagThreshold, the instance will be wrongly determined to be lagging behind and deactivated by another instance. An instance should only be deactivated if it isn't responding at all (e.g. due to network problems), but not if it's just under high load.
To Reproduce
It's hard to reproduce this issue, as it will occur in reality only if the database connection is very slow. For testing purposes, the issue could be reproduced by instrumenting the code, e.g. by adding an application property that sleeps for 5 seconds in the WorkflowBalancer.getAssignedWorkflows method
Expected behavior
The scheduler instance heartbeat should be written to the database every 5 seconds, even if runningAssignWorkflows is not finished yet.