Skip to content

Conversation

@orlandohohmeier
Copy link
Member

No description provided.

orlandohohmeier and others added 3 commits December 22, 2025 17:12
With the improved stream guratees there is no need for a time based mechanism to catch broken sends. Thus we're removing the timeouts from `SendModel`/`SendUpdate`, stopping
send actions from failing due timeouts while _slowly_ but
steadly sending sending.

Co-Authored-By: ChatGPT <[email protected]>
Replace SSE receive with a single-shot HTTP response that returns one
pointer or 204, and pass the scheduler timeout through to the bridge
as an idle deadline. The bridge now handles waiting for the first item
and the executor treats an empty response as a timeout.

Co-Authored-By: ChatGPT <[email protected]>
Track aggregated updates to ensure `ApplyUpdate` is scheduled after
broadcast, reduce idle waits to keep training progressing, and align
parameter server retries with scheduler timing to avoid stalls.

Co-Authored-By: ChatGPT <[email protected]>
@orlandohohmeier orlandohohmeier requested a review from l45k December 22, 2025 16:15
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 10.18519% with 97 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/worker/src/executor/bridge.rs 0.00% 73 Missing ⚠️
crates/scheduler/src/scheduling/batch_scheduler.rs 33.33% 22 Missing ⚠️
crates/worker/src/executor/parameter_server.rs 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants