Skip to content

Conversation

@orlandohohmeier
Copy link
Member

This PR hardens the framework such that it still functions with higher RTT and slower links by tuning request/response timeouts, buffer sizes, and retry strategies across network, scheduler, and workers. It also stabilizes scheduler round transitions and parameter server aggregation timing to prevent idle/send-update ping‑pong and stalls. Together, these changes reduce connection churn and improve training progress under simulated latency while keeping behavior consistent for normal RTT.

With the improved stream guratees there is no need for a time based mechanism to catch broken sends. Thus we're removing the timeouts from `SendModel`/`SendUpdate`, stopping
send actions from failing due timeouts while _slowly_ but
steadly sending sending.

Co-Authored-By: ChatGPT <[email protected]>
@orlandohohmeier orlandohohmeier requested review from l45k and nfnt and removed request for l45k December 23, 2025 15:29
@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch 2 times, most recently from 18502ef to 6b73984 Compare December 23, 2025 20:13
orlandohohmeier and others added 12 commits December 23, 2025 21:44
Replace SSE receive with a single-shot HTTP response that returns one
pointer or 204, and pass the scheduler timeout through to the bridge
as an idle deadline. The bridge now handles waiting for the first item
and the executor treats an empty response as a timeout.

Co-Authored-By: ChatGPT <[email protected]>
Change retry strategy for gateway connection attempts. Use FixedInterval with retry interval based
on configured RTT and increased retry count.
Increase offer and lease timeouts for more resilient scheduling.
Ajdust the buffer size for request/response handlers to avoid block in congested high RTT cases.
Increase network action buffer size from 5 to 64 to resolve blocks in congested high RTT cases.
Spawn handler request processing in a separate task so that it doens't block.
Extend worker handler to retry lease renewal.
@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch from 6b73984 to f246c74 Compare December 23, 2025 20:53
orlandohohmeier and others added 2 commits December 23, 2025 21:55
Track aggregated/applied updates so ApplyUpdate only follows broadcast and
workers don’t churn; reset aggregation flags on
broadcast and error paths. Normalize short/long timeouts for ~100ms RTT,
align PS action retries with scheduler cadence.
Fix training sleep precision to avoid truncation delays.

Co-Authored-By: ChatGPT <[email protected]>
Add a simple script to aid with local network simulation to test different network conditions.
@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch from f246c74 to 45b6d05 Compare December 23, 2025 20:57
@l45k l45k force-pushed the orlandohohmeier/network-resilience branch from 45b6d05 to 3975087 Compare December 23, 2025 21:13
@orlandohohmeier orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch 2 times, most recently from 145ac6a to 45b6d05 Compare December 23, 2025 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants