Skip to content

refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload#1468

Open
guapisolo wants to merge 10 commits into
refactor/session-routed-experts-retentionfrom
refactor/session-server-concurrency
Open

refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload#1468
guapisolo wants to merge 10 commits into
refactor/session-routed-experts-retentionfrom
refactor/session-server-concurrency

Conversation

@guapisolo

Copy link
Copy Markdown
Collaborator

What & why

Refactors the standalone Session Server (miles/rollout/session/) for safe concurrency without changing the single-process / single-uvicorn-worker deployment. The goals: stop a second concurrent same-session chat from racing the per-session trajectory state, remove a redundant parse/re-serialize on the success path, and keep the one asyncio event loop responsive under heavy routed-experts responses.

Stacked on #1463. Base branch is refactor/session-routed-experts-retention (the routed-experts retention perf change). This PR's diff is only the 10 concurrency commits on top. Merge #1463 first, or retarget to main after it lands.

Changes

1. Per-session in-flight gate (one in-flight chat per session)

  • New chat_inflight flag on LinearTrajectory; chat_completions restructured into three brief session.lock segments (claim → prepare → commit) separated by lock-free work.
  • A second concurrent chat on the same session fast-fails 409 (SessionBusyError) without entering the backend; closing (→404) takes priority over busy (→409).
  • A claimed-guarded try/finally releases the slot on every exit path (including cancellation); a request that got 404/409 never clears the owner's flag.
  • The executor and do_proxy are never awaited while holding session.lock (AST-verified).

2. Raw response passthrough

  • build_proxy_response returns the upstream bytes directly (preserving content-type, stripping stale content-length/transfer-encoding/content-encoding), removing the success-path second json.loads + JSONResponse re-dump.

3. Bounded CPU thread pool

  • SessionServer owns a ThreadPoolExecutor (--session-server-cpu-workers, default min(16, os.cpu_count())), shut down via app.router.on_shutdown.
  • Only stateless CPU work is offloaded via three pure helpers (_parse_request_body, _dump_request_body, _parse_and_validate_response); all session-state mutation stays on the event loop under the lock.

4. Stricter upstream-response validation (prevents training-data corruption)

  • _parse_and_validate_response classifies every malformed successful (200) upstream response as UpstreamResponseError/502 instead of leaking raw exceptions or accepting garbage: invalid JSON, bad choices/message/meta_info, length mismatch, non-int/bool token ids (entry[1]), and non-numeric/bool/non-finite (NaN/±Inf) logprob values (entry[0]). The latter is not inert metadata — openai_endpoint_utils reads entry[0] into Sample.rollout_log_probs, so a bad value would corrupt downstream RL training. SGLang [logprob, token_id, token_text] triples remain valid.

5. Invariant escalation + cleanup

  • num_assistant != expected_num_assistant at commit → logged ERROR + SessionInvariantError (500) instead of a silent 200-skip.
  • Removed the DEBUG _inflight_chat counter / debug_request_logger. uvicorn.run stays single-worker (multi-process + orjson deliberately deferred).

Event-loop responsiveness (measured)

5×5-iteration pooled before/after, K=64 concurrent ~1.3 MiB routed-experts responses, polling GET /health under load (tests/fast/router/bench_session_responsiveness.py):

/health under load before (inline parse) after (offloaded)
p95 ~1095 ms ~234 ms
p99 ~1339 ms ~303 ms
chat throughput ~10.5 chats/s ~17 chats/s

Per-iteration spread is tight and non-overlapping (before p95 966–1339ms; after 203–254ms). Mechanism: K concurrent inline parses serialize on the one loop (before-p95 ≈ K × single-parse ≈ 64 × 17ms ≈ 1088ms); offloading frees the loop. The GIL still bounds aggregate CPU work, so this is a responsiveness/tail-latency win, not raw CPU throughput.

Testing

  • pytest tests/fast/router/test_sessions.py tests/fast/router/test_session_race_conditions.py tests/fast/router/test_linear_trajectory.py -q81 passed.
  • Covers: same-session 409 (with promptness assertion + backend-arrival barrier), concurrent-chats-then-delete, slot release on every enumerated failure path (prepare error, non-200 passthrough, transport 502, state-update error, injected cancel, real client disconnect), invariant-mismatch 500 + ERROR log, passthrough fidelity, and full response-validation rejection (token ids, logprob values incl. non-finite, e2e bad-logprob → 502 with no record/token committed).

🤖 Generated with Claude Code

guapisolo and others added 10 commits June 22, 2026 20:44
…ded CPU thread pool

Enforce strict per-session linearity on the single-worker session server while
letting independent sessions run concurrently:

- In-flight gate: each LinearTrajectory admits one in-flight chat; a second
  concurrent same-session chat fast-fails 409 (SessionBusyError) at slot-claim,
  without entering SGLang. chat_completions is restructured into brief lock
  segments (claim / prepare / commit) with a claimed-guarded, cancellation-safe
  finally that releases the slot on every exit path; closing (404) beats busy.
- expected_num_assistant mismatch is now a logged 500 SessionInvariantError
  (unreachable under the gate) instead of a silent 200 skip.
- build_proxy_response passes the upstream body through unchanged (no second
  parse / re-serialize), preserving content-type and stripping stale framing
  headers; applies to all call sites.
- Bounded ThreadPoolExecutor (--session-server-cpu-workers, default
  min(16, os.cpu_count() or 1)) offloads only stateless CPU work (request/response
  JSON, validation) off the event loop; all session-state mutation stays on the
  event loop under session.lock. Shut down on app shutdown.
- Removed the DEBUG _inflight_chat counter and debug_request_logger middleware.
- Single uvicorn worker preserved; multi-process and orjson deferred (documented).

Tests: rewrote the same-session concurrency test to the 409 contract; added
slot-release-after-error and passthrough-fidelity coverage. 62 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Malformed successful upstream responses could leak raw parser and shape exceptions as 500s while several slot-release and passthrough paths were unverified. Harden the response validator to raise UpstreamResponseError and add focused tests for injected failure exits plus raw proxy response fidelity.
Adds tests/fast/router/bench_session_responsiveness.py (bench_ prefix → not
collected by pytest), driving K concurrent large-routed-experts chats across
distinct sessions while polling /health, reporting /health p50/p95/p99 under load.

Honest result (real numbers, no fabrication): with multi-MiB responses the
event loop is dominated by on-loop body I/O (httpx read + uvicorn write), so the
CPU-parse offload yields no measurable end-to-end /health speedup at this scale
(after ~= before: p50~20ms, p95~58ms, ~18.6 chats/s, 0 errors in both builds).
An isolated probe still shows the offload mechanism works - inline ~150ms parse
blocks the loop ~741ms vs ~0.07ms offloaded - so offloading helps only when an
individual parse is itself large enough to block the loop. GIL means total CPU
throughput is unchanged, as designed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add promptness, invariant-mismatch, and real-disconnect coverage for the session in-flight gate, and scale the responsiveness benchmark to 64 sessions while documenting that the current benchmark still lacks internal stage timing.
…ation

_parse_and_validate_response previously accepted any value at output_token_logprobs[i][1]
(string / float / None / bool token ids, and even a bare string entry whose [1] indexes a
character), and treated a bool completion_tokens as 1/0 — silently corrupting the stored
trajectory token ids on a malformed-but-HTTP-200 SGLang response. Now each entry must be a
(logprob, token_id) pair with a strict int token id, and completion_tokens must be a non-bool
int; violations raise UpstreamResponseError (502). Adds TestResponseTokenIdValidation.

76 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…validation

A successful (200) upstream chat response whose output_token_logprobs entry
has a non-numeric logprob (entry[0]: str/None/bool) was accepted and recorded.
That value flows into Sample.rollout_log_probs via openai_endpoint_utils and
would corrupt downstream RL training. Validate entry[0] is a real number
(int/float, bool excluded) alongside the existing token-id check, and classify
violations as UpstreamResponseError (502). SGLang [logprob, token_id, token_text]
triples (len > 2) remain valid.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an HTTP-level one-shot test proving a 200 upstream response with a non-numeric logprob value returns 502, commits no record or accumulated token id, and releases the slot so the next legal chat returns 200.

Extend the responsiveness benchmark with JSON output, comparison output, commit/dirty metadata, per-stage CPU timing, and computed health percentiles so before/after runs can persist reviewable artifacts.
…ore/after

A 5x5-iteration pooled before/after run (inline parse vs cpu_executor offload,
K=64 concurrent ~1.3 MiB responses) shows the offload markedly improves event-loop
responsiveness and chat throughput at this scale: /health p95 ~1095ms -> ~234ms,
p99 ~1339ms -> ~303ms, throughput ~10.5 -> ~17 chats/s, with tight, non-overlapping
per-iteration spread. Mechanism: K inline parses serialize on the one loop
(before-p95 ~= K x single-parse cost ~= 64 x 17ms); offloading frees the loop.

This corrects the bench's earlier speculative interpretation, which claimed
no significant end-to-end change (it reasoned about a single parse vs body I/O
and missed concurrent-parse stacking). The GIL still bounds aggregate CPU work,
so the gain is responsiveness/tail-latency, not raw CPU throughput.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reject raw upstream NaN and Infinity logprob constants at the session response boundary so malformed successful responses return UpstreamResponseError instead of committing invalid rollout logprobs. Update the responsiveness comparison verdict to distinguish material improvements from noise-level no-regression results.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant