Skip to content

queries are not processed if frontend connects to scheduler too early #19528

@mxab

Description

@mxab

Describe the bug
We observe that quite often when the scheduler restarts, querying stop workings.
We run Loki in microservice mode with use_scheduler_ring = true on Nomad.

After putting a lot of log statements into Loki we observed that the frontend reconnects to the scheduler before the scheduler shouldRun state is set to true, which is initially false as we use the use_scheduler_ring = true. For some timing reason (memberlist state?) the frontend reconnects before that which then causes the scheduler skip the response to frontends INIT .
Thus the frontend starts piling up in progress query requests until they timeout.

As I haven't any issue yet and this happens quite often, there might be something wonky in our config. Still I think this can be considered a bug as system locks in a strange unresponsive way.

To Reproduce
I'm still facing issue to reproduce this in a deterministic way as I've not understand the full memberlist/scheduler ring/frontend/scheduler connect protocol. Happy to get some pointers here

Expected behavior
Scheduler restart, frontend reconnects, frontends continues scheduling queries
Environment:

  • Loki 3.5.7 (also experienced earlier), microservice mode, memberlist, use_scheduler_ring
  • Infrastructure: Nomad, Consul, Vault
  • Deployment tool: Nomad Job files :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions