-
Notifications
You must be signed in to change notification settings - Fork 101
Description
At the moment the flush interval for the checkin.NewBulk is 10 * time.Second. That means to me that during check-in the Elastic Agent document will not be updated for at most 10 seconds. I believe this could be an issue with ensuring that the .fleet-agents document for that Elastic Agent is actually correct for the next check-in.
The following scenario that could occur is (especially when there is multiple Fleet Servers). Lets assume there is Fleet Server 1 and Fleet Server 2.
- Elastic Agent checks-in on Fleet Server 1.
- It has a status that needs to be written and updated, but it also has actions that it needs to handle.
- Fleet Server 1 sends a response as it has an action.
- At this point the check-in bulk has not occurred (as its every 10 seconds and it all depends on when Fleet Server started for that interval)
- Elastic Agent checks-in on Fleet Server 2 (round-robin).
- It has a different status that needs to be written, no actions to handle so it stays connected.
- Now another check-in bulk needs to occur, but that happens in its own 10 second interval (the interval is not in sync with Fleet Server 1)
This means it is possible that Fleet Server 2 performs the sync before Fleet Server 1, but then Fleet Server 1 performs it sync. This now means that the first check-in overwrites the second check-in, when it shouldn't because actually the second check-in should take preference over the first one.
I believe we might need to ensure that upon check-in that the document is written as soon as possible. This would ensure this doesn't happen. The constant writing of the document to show that the Elastic Agent is connected to the long-poll endpoint can still use the 10 second window, but the initial check-in should not.