-
Notifications
You must be signed in to change notification settings - Fork 1
network resilience #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
orlandohohmeier
wants to merge
15
commits into
alpha
Choose a base branch
from
orlandohohmeier/network-resilience
base: alpha
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
network resilience #263
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
With the improved stream guratees there is no need for a time based mechanism to catch broken sends. Thus we're removing the timeouts from `SendModel`/`SendUpdate`, stopping send actions from failing due timeouts while _slowly_ but steadly sending sending. Co-Authored-By: ChatGPT <[email protected]>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
l45k
reviewed
Dec 23, 2025
18502ef to
6b73984
Compare
Replace SSE receive with a single-shot HTTP response that returns one pointer or 204, and pass the scheduler timeout through to the bridge as an idle deadline. The bridge now handles waiting for the first item and the executor treats an empty response as a timeout. Co-Authored-By: ChatGPT <[email protected]>
Change retry strategy for gateway connection attempts. Use FixedInterval with retry interval based on configured RTT and increased retry count.
Increase offer and lease timeouts for more resilient scheduling.
Ajdust the buffer size for request/response handlers to avoid block in congested high RTT cases.
Increase network action buffer size from 5 to 64 to resolve blocks in congested high RTT cases.
Spawn handler request processing in a separate task so that it doens't block.
Extend worker handler to retry lease renewal.
6b73984 to
f246c74
Compare
Track aggregated/applied updates so ApplyUpdate only follows broadcast and workers don’t churn; reset aggregation flags on broadcast and error paths. Normalize short/long timeouts for ~100ms RTT, align PS action retries with scheduler cadence. Fix training sleep precision to avoid truncation delays. Co-Authored-By: ChatGPT <[email protected]>
Add a simple script to aid with local network simulation to test different network conditions.
f246c74 to
45b6d05
Compare
45b6d05 to
3975087
Compare
145ac6a to
45b6d05
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR hardens the framework such that it still functions with higher RTT and slower links by tuning request/response timeouts, buffer sizes, and retry strategies across network, scheduler, and workers. It also stabilizes scheduler round transitions and parameter server aggregation timing to prevent idle/send-update ping‑pong and stalls. Together, these changes reduce connection churn and improve training progress under simulated latency while keeping behavior consistent for normal RTT.