-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the bug
We observe that quite often when the scheduler restarts, querying stop workings.
We run Loki in microservice mode with use_scheduler_ring = true
on Nomad.
After putting a lot of log statements into Loki we observed that the frontend
reconnects to the scheduler before the scheduler shouldRun
state is set to true
, which is initially false
as we use the use_scheduler_ring = true
. For some timing reason (memberlist state?) the frontend reconnects before that which then causes the scheduler skip the response to frontends INIT
.
Thus the frontend starts piling up in progress query requests until they timeout.
As I haven't any issue yet and this happens quite often, there might be something wonky in our config. Still I think this can be considered a bug as system locks in a strange unresponsive way.
To Reproduce
I'm still facing issue to reproduce this in a deterministic way as I've not understand the full memberlist/scheduler ring/frontend/scheduler connect protocol. Happy to get some pointers here
Expected behavior
Scheduler restart, frontend reconnects, frontends continues scheduling queries
Environment:
- Loki 3.5.7 (also experienced earlier), microservice mode, memberlist, use_scheduler_ring
- Infrastructure: Nomad, Consul, Vault
- Deployment tool: Nomad Job files :)