Skip to content

Conversation

mxab
Copy link

@mxab mxab commented Oct 17, 2025

What this PR does / why we need it:
There can be a race condition when the frontend worker connect to the scheduler and the scheduler is in a RUNNING state but not yet in a runState as its not in the ReplicationSet yet.

This leads to a state where the frontend does not receive an OK on its INIT request therefore hanging and not forwarding querying requests.

Which issue(s) this PR fixes:
Fixes #19528

Special notes for your reviewer:

This PR has currently to commits mitigating the issue

  1. 382efed: Ensures there is a timeout when doing the receive for an init response from the scheduler. The timeout then ensures another INIT attempt is done.
  2. 176e354: When the scheduler is in this strange state where it is RUNNING but not in shouldRun (which also does not mean it should be shutting down) it replies to the INIT with an ERROR also triggering the frontend to retry to connect.

One note regarding the 1. solution:

  • The tests actually works without the fix ( 😦 ) for some reason in the test run the loop receive is somehow canceled/time-outed after ~1m21, but this is not the behavior in our prod system but I'm quite sure that this is the issue, if you have any idea why this is happening please let me know. It may be different grpc server setups, as the test does grpc.NewServer while in the main server there is a lot of extra config
  • I guess the timeout should be made somehow configurable

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@CLAassistant
Copy link

CLAassistant commented Oct 17, 2025

CLA assistant check
All committers have signed the CLA.

@mxab mxab force-pushed the fix/scheduler-frontend-init branch from 686104d to 176e354 Compare October 20, 2025 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

queries are not processed if frontend connects to scheduler too early

2 participants