refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload by guapisolo · Pull Request #1468 · radixark/miles

guapisolo · 2026-06-23T06:58:08Z

What & why

Refactors the standalone Session Server (miles/rollout/session/) for safe concurrency without changing the single-process / single-uvicorn-worker deployment. The goals: stop a second concurrent same-session chat from racing the per-session trajectory state, remove a redundant parse/re-serialize on the success path, and keep the one asyncio event loop responsive under heavy routed-experts responses.

Stacked on #1463. Base branch is refactor/session-routed-experts-retention (the routed-experts retention perf change). This PR's diff is only the 10 concurrency commits on top. Merge #1463 first, or retarget to main after it lands.

Changes

1. Per-session in-flight gate (one in-flight chat per session)

New chat_inflight flag on LinearTrajectory; chat_completions restructured into three brief session.lock segments (claim → prepare → commit) separated by lock-free work.
A second concurrent chat on the same session fast-fails 409 (SessionBusyError) without entering the backend; closing (→404) takes priority over busy (→409).
A claimed-guarded try/finally releases the slot on every exit path (including cancellation); a request that got 404/409 never clears the owner's flag.
The executor and do_proxy are never awaited while holding session.lock (AST-verified).

2. Raw response passthrough

build_proxy_response returns the upstream bytes directly (preserving content-type, stripping stale content-length/transfer-encoding/content-encoding), removing the success-path second json.loads + JSONResponse re-dump.

3. Bounded CPU thread pool

SessionServer owns a ThreadPoolExecutor (--session-server-cpu-workers, default min(16, os.cpu_count())), shut down via app.router.on_shutdown.
Only stateless CPU work is offloaded via three pure helpers (_parse_request_body, _dump_request_body, _parse_and_validate_response); all session-state mutation stays on the event loop under the lock.

4. Stricter upstream-response validation (prevents training-data corruption)

_parse_and_validate_response classifies every malformed successful (200) upstream response as UpstreamResponseError/502 instead of leaking raw exceptions or accepting garbage: invalid JSON, bad choices/message/meta_info, length mismatch, non-int/bool token ids (entry[1]), and non-numeric/bool/non-finite (NaN/±Inf) logprob values (entry[0]). The latter is not inert metadata — openai_endpoint_utils reads entry[0] into Sample.rollout_log_probs, so a bad value would corrupt downstream RL training. SGLang [logprob, token_id, token_text] triples remain valid.

5. Invariant escalation + cleanup

num_assistant != expected_num_assistant at commit → logged ERROR + SessionInvariantError (500) instead of a silent 200-skip.
Removed the DEBUG _inflight_chat counter / debug_request_logger. uvicorn.run stays single-worker (multi-process + orjson deliberately deferred).

Event-loop responsiveness (measured)

5×5-iteration pooled before/after, K=64 concurrent ~1.3 MiB routed-experts responses, polling GET /health under load (tests/fast/router/bench_session_responsiveness.py):

/health under load	before (inline parse)	after (offloaded)
p95	~1095 ms	~234 ms
p99	~1339 ms	~303 ms
chat throughput	~10.5 chats/s	~17 chats/s

Per-iteration spread is tight and non-overlapping (before p95 966–1339ms; after 203–254ms). Mechanism: K concurrent inline parses serialize on the one loop (before-p95 ≈ K × single-parse ≈ 64 × 17ms ≈ 1088ms); offloading frees the loop. The GIL still bounds aggregate CPU work, so this is a responsiveness/tail-latency win, not raw CPU throughput.

Testing

pytest tests/fast/router/test_sessions.py tests/fast/router/test_session_race_conditions.py tests/fast/router/test_linear_trajectory.py -q → 81 passed.
Covers: same-session 409 (with promptness assertion + backend-arrival barrier), concurrent-chats-then-delete, slot release on every enumerated failure path (prepare error, non-200 passthrough, transport 502, state-update error, injected cancel, real client disconnect), invariant-mismatch 500 + ERROR log, passthrough fidelity, and full response-validation rejection (token ids, logprob values incl. non-finite, e2e bad-logprob → 502 with no record/token committed).

🤖 Generated with Claude Code

…ded CPU thread pool Enforce strict per-session linearity on the single-worker session server while letting independent sessions run concurrently: - In-flight gate: each LinearTrajectory admits one in-flight chat; a second concurrent same-session chat fast-fails 409 (SessionBusyError) at slot-claim, without entering SGLang. chat_completions is restructured into brief lock segments (claim / prepare / commit) with a claimed-guarded, cancellation-safe finally that releases the slot on every exit path; closing (404) beats busy. - expected_num_assistant mismatch is now a logged 500 SessionInvariantError (unreachable under the gate) instead of a silent 200 skip. - build_proxy_response passes the upstream body through unchanged (no second parse / re-serialize), preserving content-type and stripping stale framing headers; applies to all call sites. - Bounded ThreadPoolExecutor (--session-server-cpu-workers, default min(16, os.cpu_count() or 1)) offloads only stateless CPU work (request/response JSON, validation) off the event loop; all session-state mutation stays on the event loop under session.lock. Shut down on app shutdown. - Removed the DEBUG _inflight_chat counter and debug_request_logger middleware. - Single uvicorn worker preserved; multi-process and orjson deferred (documented). Tests: rewrote the same-session concurrency test to the 409 contract; added slot-release-after-error and passthrough-fidelity coverage. 62 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Malformed successful upstream responses could leak raw parser and shape exceptions as 500s while several slot-release and passthrough paths were unverified. Harden the response validator to raise UpstreamResponseError and add focused tests for injected failure exits plus raw proxy response fidelity.

Adds tests/fast/router/bench_session_responsiveness.py (bench_ prefix → not collected by pytest), driving K concurrent large-routed-experts chats across distinct sessions while polling /health, reporting /health p50/p95/p99 under load. Honest result (real numbers, no fabrication): with multi-MiB responses the event loop is dominated by on-loop body I/O (httpx read + uvicorn write), so the CPU-parse offload yields no measurable end-to-end /health speedup at this scale (after ~= before: p50~20ms, p95~58ms, ~18.6 chats/s, 0 errors in both builds). An isolated probe still shows the offload mechanism works - inline ~150ms parse blocks the loop ~741ms vs ~0.07ms offloaded - so offloading helps only when an individual parse is itself large enough to block the loop. GIL means total CPU throughput is unchanged, as designed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add promptness, invariant-mismatch, and real-disconnect coverage for the session in-flight gate, and scale the responsiveness benchmark to 64 sessions while documenting that the current benchmark still lacks internal stage timing.

…ation _parse_and_validate_response previously accepted any value at output_token_logprobs[i][1] (string / float / None / bool token ids, and even a bare string entry whose [1] indexes a character), and treated a bool completion_tokens as 1/0 — silently corrupting the stored trajectory token ids on a malformed-but-HTTP-200 SGLang response. Now each entry must be a (logprob, token_id) pair with a strict int token id, and completion_tokens must be a non-bool int; violations raise UpstreamResponseError (502). Adds TestResponseTokenIdValidation. 76 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…validation A successful (200) upstream chat response whose output_token_logprobs entry has a non-numeric logprob (entry[0]: str/None/bool) was accepted and recorded. That value flows into Sample.rollout_log_probs via openai_endpoint_utils and would corrupt downstream RL training. Validate entry[0] is a real number (int/float, bool excluded) alongside the existing token-id check, and classify violations as UpstreamResponseError (502). SGLang [logprob, token_id, token_text] triples (len > 2) remain valid. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add an HTTP-level one-shot test proving a 200 upstream response with a non-numeric logprob value returns 502, commits no record or accumulated token id, and releases the slot so the next legal chat returns 200. Extend the responsiveness benchmark with JSON output, comparison output, commit/dirty metadata, per-stage CPU timing, and computed health percentiles so before/after runs can persist reviewable artifacts.

…ore/after A 5x5-iteration pooled before/after run (inline parse vs cpu_executor offload, K=64 concurrent ~1.3 MiB responses) shows the offload markedly improves event-loop responsiveness and chat throughput at this scale: /health p95 ~1095ms -> ~234ms, p99 ~1339ms -> ~303ms, throughput ~10.5 -> ~17 chats/s, with tight, non-overlapping per-iteration spread. Mechanism: K inline parses serialize on the one loop (before-p95 ~= K x single-parse cost ~= 64 x 17ms); offloading frees the loop. This corrects the bench's earlier speculative interpretation, which claimed no significant end-to-end change (it reasoned about a single parse vs body I/O and missed concurrent-parse stacking). The GIL still bounds aggregate CPU work, so the gain is responsiveness/tail-latency, not raw CPU throughput. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reject raw upstream NaN and Infinity logprob constants at the session response boundary so malformed successful responses return UpstreamResponseError instead of committing invalid rollout logprobs. Update the responsiveness comparison verdict to distinguish material improvements from noise-level no-regression results.

gemini-code-assist · 2026-06-23T06:58:11Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

guapisolo and others added 10 commits June 22, 2026 20:44

test(session): verify session gate edge cases

3c2aea0

guapisolo requested review from Shi-Dong, Zhichenzzz, fzyzcjy, jybsuper, maocheng23 and yueming-yuan as code owners June 23, 2026 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload#1468

refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload#1468
guapisolo wants to merge 10 commits into
refactor/session-routed-experts-retentionfrom
refactor/session-server-concurrency

guapisolo commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guapisolo commented Jun 23, 2026

What & why

Changes

Event-loop responsiveness (measured)

Testing

Uh oh!

gemini-code-assist Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant