network resilience #263

orlandohohmeier · 2025-12-23T15:29:26Z

This PR hardens the framework such that it still functions with higher RTT and slower links by tuning request/response timeouts, buffer sizes, and retry strategies across network, scheduler, and workers. It also stabilizes scheduler round transitions and parameter server aggregation timing to prevent idle/send-update ping‑pong and stalls. Together, these changes reduce connection churn and improve training progress under simulated latency while keeping behavior consistent for normal RTT.

With the improved stream guratees there is no need for a time based mechanism to catch broken sends. Thus we're removing the timeouts from `SendModel`/`SendUpdate`, stopping send actions from failing due timeouts while _slowly_ but steadly sending sending. Co-Authored-By: ChatGPT <[email protected]>

codecov · 2025-12-23T15:31:47Z

Codecov Report

❌ Patch coverage is 14.38356% with 250 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/worker/src/executor/bridge.rs	0.00%	74 Missing ⚠️
crates/scheduler/src/scheduling/batch_scheduler.rs	51.85%	39 Missing ⚠️
crates/worker/src/connector/mod.rs	0.00%	36 Missing ⚠️
crates/worker/src/executor/parameter_server.rs	0.00%	36 Missing ⚠️
crates/network/src/request_response.rs	0.00%	16 Missing ⚠️
crates/scheduler/src/worker.rs	0.00%	16 Missing ⚠️
crates/scheduler/src/network.rs	0.00%	9 Missing ⚠️
crates/data/src/network.rs	0.00%	7 Missing ⚠️
crates/worker/src/network.rs	0.00%	7 Missing ⚠️
crates/data/src/bin/hypha-data.rs	0.00%	3 Missing ⚠️
... and 3 more

📢 Thoughts on this report? Let us know!

crates/data/src/bin/hypha-data.rs

executors/accelerate/src/hypha/accelerate_executor/api.py

Replace SSE receive with a single-shot HTTP response that returns one pointer or 204, and pass the scheduler timeout through to the bridge as an idle deadline. The bridge now handles waiting for the first item and the executor treats an empty response as a timeout. Co-Authored-By: ChatGPT <[email protected]>

Change retry strategy for gateway connection attempts. Use FixedInterval with retry interval based on configured RTT and increased retry count.

Increase offer and lease timeouts for more resilient scheduling.

Ajdust the buffer size for request/response handlers to avoid block in congested high RTT cases.

Increase network action buffer size from 5 to 64 to resolve blocks in congested high RTT cases.

Spawn handler request processing in a separate task so that it doens't block.

Extend worker handler to retry lease renewal.

Track aggregated/applied updates so ApplyUpdate only follows broadcast and workers don’t churn; reset aggregation flags on broadcast and error paths. Normalize short/long timeouts for ~100ms RTT, align PS action retries with scheduler cadence. Fix training sleep precision to avoid truncation delays. Co-Authored-By: ChatGPT <[email protected]>

Add a simple script to aid with local network simulation to test different network conditions.

orlandohohmeier requested review from l45k and nfnt and removed request for l45k December 23, 2025 15:29

l45k reviewed Dec 23, 2025

View reviewed changes

crates/data/src/bin/hypha-data.rs Outdated Show resolved Hide resolved

executors/accelerate/src/hypha/accelerate_executor/api.py Outdated Show resolved Hide resolved

orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch 2 times, most recently from 18502ef to 6b73984 Compare December 23, 2025 20:13

orlandohohmeier and others added 12 commits December 23, 2025 21:44

feat: set request timeout based on RTT

e6c480e

fix: set idle connection timeout to 30s

3411309

refactor: adjust gateway connect retry strategy

0934b06

Change retry strategy for gateway connection attempts. Use FixedInterval with retry interval based on configured RTT and increased retry count.

fix: increase offer/lease timeouts

12b5d52

Increase offer and lease timeouts for more resilient scheduling.

fix(request_response): increase buffer size

16b8c93

Ajdust the buffer size for request/response handlers to avoid block in congested high RTT cases.

fix: increase network action buffer size

921e5e7

Increase network action buffer size from 5 to 64 to resolve blocks in congested high RTT cases.

fix(reqest_response): move request handler into task

82797d2

Spawn handler request processing in a separate task so that it doens't block.

fix(connector): adjust send to send data concurrently

f2c0bfb

fix: retry lease renewal

ad414c6

Extend worker handler to retry lease renewal.

fix: adjust parameter server timeouts

d7e7638

fix: adjust retry for actions

ea41b04

orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch from 6b73984 to f246c74 Compare December 23, 2025 20:53

orlandohohmeier and others added 2 commits December 23, 2025 21:55

chore: add network simulation script

45b6d05

Add a simple script to aid with local network simulation to test different network conditions.

orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch from f246c74 to 45b6d05 Compare December 23, 2025 20:57

l45k force-pushed the orlandohohmeier/network-resilience branch from 45b6d05 to 3975087 Compare December 23, 2025 21:13

orlandohohmeier force-pushed the orlandohohmeier/network-resilience branch 2 times, most recently from 145ac6a to 45b6d05 Compare December 23, 2025 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

network resilience #263

network resilience #263

Uh oh!

orlandohohmeier commented Dec 23, 2025

Uh oh!

codecov bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

network resilience #263

Are you sure you want to change the base?

network resilience #263

Uh oh!

Conversation

orlandohohmeier commented Dec 23, 2025

Uh oh!

codecov bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 23, 2025 •

edited

Loading