refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload#1468
Open
guapisolo wants to merge 10 commits into
Open
Conversation
…ded CPU thread pool Enforce strict per-session linearity on the single-worker session server while letting independent sessions run concurrently: - In-flight gate: each LinearTrajectory admits one in-flight chat; a second concurrent same-session chat fast-fails 409 (SessionBusyError) at slot-claim, without entering SGLang. chat_completions is restructured into brief lock segments (claim / prepare / commit) with a claimed-guarded, cancellation-safe finally that releases the slot on every exit path; closing (404) beats busy. - expected_num_assistant mismatch is now a logged 500 SessionInvariantError (unreachable under the gate) instead of a silent 200 skip. - build_proxy_response passes the upstream body through unchanged (no second parse / re-serialize), preserving content-type and stripping stale framing headers; applies to all call sites. - Bounded ThreadPoolExecutor (--session-server-cpu-workers, default min(16, os.cpu_count() or 1)) offloads only stateless CPU work (request/response JSON, validation) off the event loop; all session-state mutation stays on the event loop under session.lock. Shut down on app shutdown. - Removed the DEBUG _inflight_chat counter and debug_request_logger middleware. - Single uvicorn worker preserved; multi-process and orjson deferred (documented). Tests: rewrote the same-session concurrency test to the 409 contract; added slot-release-after-error and passthrough-fidelity coverage. 62 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Malformed successful upstream responses could leak raw parser and shape exceptions as 500s while several slot-release and passthrough paths were unverified. Harden the response validator to raise UpstreamResponseError and add focused tests for injected failure exits plus raw proxy response fidelity.
Adds tests/fast/router/bench_session_responsiveness.py (bench_ prefix → not collected by pytest), driving K concurrent large-routed-experts chats across distinct sessions while polling /health, reporting /health p50/p95/p99 under load. Honest result (real numbers, no fabrication): with multi-MiB responses the event loop is dominated by on-loop body I/O (httpx read + uvicorn write), so the CPU-parse offload yields no measurable end-to-end /health speedup at this scale (after ~= before: p50~20ms, p95~58ms, ~18.6 chats/s, 0 errors in both builds). An isolated probe still shows the offload mechanism works - inline ~150ms parse blocks the loop ~741ms vs ~0.07ms offloaded - so offloading helps only when an individual parse is itself large enough to block the loop. GIL means total CPU throughput is unchanged, as designed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add promptness, invariant-mismatch, and real-disconnect coverage for the session in-flight gate, and scale the responsiveness benchmark to 64 sessions while documenting that the current benchmark still lacks internal stage timing.
…ation _parse_and_validate_response previously accepted any value at output_token_logprobs[i][1] (string / float / None / bool token ids, and even a bare string entry whose [1] indexes a character), and treated a bool completion_tokens as 1/0 — silently corrupting the stored trajectory token ids on a malformed-but-HTTP-200 SGLang response. Now each entry must be a (logprob, token_id) pair with a strict int token id, and completion_tokens must be a non-bool int; violations raise UpstreamResponseError (502). Adds TestResponseTokenIdValidation. 76 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…validation A successful (200) upstream chat response whose output_token_logprobs entry has a non-numeric logprob (entry[0]: str/None/bool) was accepted and recorded. That value flows into Sample.rollout_log_probs via openai_endpoint_utils and would corrupt downstream RL training. Validate entry[0] is a real number (int/float, bool excluded) alongside the existing token-id check, and classify violations as UpstreamResponseError (502). SGLang [logprob, token_id, token_text] triples (len > 2) remain valid. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an HTTP-level one-shot test proving a 200 upstream response with a non-numeric logprob value returns 502, commits no record or accumulated token id, and releases the slot so the next legal chat returns 200. Extend the responsiveness benchmark with JSON output, comparison output, commit/dirty metadata, per-stage CPU timing, and computed health percentiles so before/after runs can persist reviewable artifacts.
…ore/after A 5x5-iteration pooled before/after run (inline parse vs cpu_executor offload, K=64 concurrent ~1.3 MiB responses) shows the offload markedly improves event-loop responsiveness and chat throughput at this scale: /health p95 ~1095ms -> ~234ms, p99 ~1339ms -> ~303ms, throughput ~10.5 -> ~17 chats/s, with tight, non-overlapping per-iteration spread. Mechanism: K inline parses serialize on the one loop (before-p95 ~= K x single-parse cost ~= 64 x 17ms); offloading frees the loop. This corrects the bench's earlier speculative interpretation, which claimed no significant end-to-end change (it reasoned about a single parse vs body I/O and missed concurrent-parse stacking). The GIL still bounds aggregate CPU work, so the gain is responsiveness/tail-latency, not raw CPU throughput. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reject raw upstream NaN and Infinity logprob constants at the session response boundary so malformed successful responses return UpstreamResponseError instead of committing invalid rollout logprobs. Update the responsiveness comparison verdict to distinguish material improvements from noise-level no-regression results.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Refactors the standalone Session Server (
miles/rollout/session/) for safe concurrency without changing the single-process / single-uvicorn-worker deployment. The goals: stop a second concurrent same-session chat from racing the per-session trajectory state, remove a redundant parse/re-serialize on the success path, and keep the one asyncio event loop responsive under heavy routed-experts responses.Changes
1. Per-session in-flight gate (one in-flight chat per session)
chat_inflightflag onLinearTrajectory;chat_completionsrestructured into three briefsession.locksegments (claim → prepare → commit) separated by lock-free work.SessionBusyError) without entering the backend;closing(→404) takes priority over busy (→409).claimed-guardedtry/finallyreleases the slot on every exit path (including cancellation); a request that got 404/409 never clears the owner's flag.do_proxyare never awaited while holdingsession.lock(AST-verified).2. Raw response passthrough
build_proxy_responsereturns the upstream bytes directly (preservingcontent-type, stripping stalecontent-length/transfer-encoding/content-encoding), removing the success-path secondjson.loads+JSONResponsere-dump.3. Bounded CPU thread pool
SessionServerowns aThreadPoolExecutor(--session-server-cpu-workers, defaultmin(16, os.cpu_count())), shut down viaapp.router.on_shutdown._parse_request_body,_dump_request_body,_parse_and_validate_response); all session-state mutation stays on the event loop under the lock.4. Stricter upstream-response validation (prevents training-data corruption)
_parse_and_validate_responseclassifies every malformed successful (200) upstream response asUpstreamResponseError/502 instead of leaking raw exceptions or accepting garbage: invalid JSON, badchoices/message/meta_info, length mismatch, non-int/booltoken ids (entry[1]), and non-numeric/bool/non-finite (NaN/±Inf) logprob values (entry[0]). The latter is not inert metadata —openai_endpoint_utilsreads entry[0] intoSample.rollout_log_probs, so a bad value would corrupt downstream RL training. SGLang[logprob, token_id, token_text]triples remain valid.5. Invariant escalation + cleanup
num_assistant != expected_num_assistantat commit → loggedERROR+SessionInvariantError(500) instead of a silent 200-skip._inflight_chatcounter /debug_request_logger.uvicorn.runstays single-worker (multi-process +orjsondeliberately deferred).Event-loop responsiveness (measured)
5×5-iteration pooled before/after, K=64 concurrent ~1.3 MiB routed-experts responses, polling
GET /healthunder load (tests/fast/router/bench_session_responsiveness.py):Per-iteration spread is tight and non-overlapping (before p95 966–1339ms; after 203–254ms). Mechanism: K concurrent inline parses serialize on the one loop (before-p95 ≈ K × single-parse ≈ 64 × 17ms ≈ 1088ms); offloading frees the loop. The GIL still bounds aggregate CPU work, so this is a responsiveness/tail-latency win, not raw CPU throughput.
Testing
pytest tests/fast/router/test_sessions.py tests/fast/router/test_session_race_conditions.py tests/fast/router/test_linear_trajectory.py -q→ 81 passed.🤖 Generated with Claude Code