perf(session): drop superseded routed_experts/indexer_topk blobs from… by guapisolo · Pull Request #1463 · radixark/miles

guapisolo · 2026-06-22T18:43:16Z

Summary

Drop superseded routed_experts/indexer_topk blobs from linear-trajectory session records.

Motivation

Each chat-completion turn stored the full upstream response in its SessionRecord, including the all-token routed_experts/indexer_topk blob (~1KB/token over the whole prompt+output). Every turn's prompt is the full accumulated prefix, so the blobs overlap and grow per turn — retained payload is O(turns × prefix). A 64-trajectory x 50-turn x ~50k-context run retained tens of GB; the all-token run failed with 502. The merged training sample and the per-turn output_token_logprobs stay unchanged.

Before / After

Before / After: before, every record kept its blob, so retained payload was O(turns × prefix); after, append_record drops the blob from records that can no longer be the merged tail, so retained payload is O(prefix).
What moved where: the strip lives in LinearTrajectory.append_record, retaining the last MAX_ASSISTANT_ROLLBACK_STEPS + 1 records' blobs.
SessionRegistry.__init__ gains a generate_multi_samples is False assert; the last-wins reasoning only holds when turns are merged.
The agentic_tool_call_multi_samples test variant is removed accordingly.

Behavior Preservation

How we know: test_stripping_superseded_records_preserves_merged_r3 replays a 3-turn chain, strips the first two records, then asserts the merged rollout_routed_experts is byte-identical to the all-present baseline.
test_keeps_last_two_and_preserves_logprobs asserts output_token_logprobs survive every strip.
test_single_rollback_leaves_surviving_tail_with_blob asserts a rolled-back tail still carries its blob.

Verification

Existing suite (test_linear_trajectory + test_openai_endpoint_utils + test_sessions + test_session_race_conditions) → 82 passed.

Retention microbenchmark drives the real append_record with synthetic blobs sized to the workload profile (~1KB/token, ~1k tokens/turn, so context reaches ~50k at turn 50). Per trajectory, retained blob bytes grow with turns pre-fix but stay O(prefix) after:

 turns | keep-all (pre-fix) MB | fixed (this PR) MB | reduction
    10 |                  53.7 |               18.6 |     2.9x
    20 |                 205.1 |               38.1 |     5.4x
    30 |                 454.1 |               57.6 |     7.9x
    40 |                 800.8 |               77.1 |    10.4x
    50 |                1245.1 |               96.7 |    12.9x

The session server is a singleton process (router_manager.py:116), so all in-flight sessions' retained sets coexist in one process. Building 64 concurrent sessions x 50 turns in one process measures the aggregate directly (not extrapolated), via tracemalloc live Python heap:

                   live Python heap | retained blobs
keep-all (pre-fix)        77.83 GB  |   77.82 GB     (substantiates "tens of GB" -> 502)
fixed (this PR)            6.05 GB  |    6.04 GB

Live heap equals retained blobs in both runs, confirming the popped blobs are released rather than held by a lingering reference. tracemalloc is used instead of RSS because glibc keeps freed chunks resident, which would over-report the fixed run.

Review Focus

Scrutinize the append_record strip index len(records) - 1 - (MAX_ASSISTANT_ROLLBACK_STEPS + 1) against _try_detect_and_rollback_to_assistant_checkpoint, which truncates records on a single-step rollback.
Scrutinize the SessionRegistry.__init__ assert: confirm the session-server subprocess receives the same args, so generate_multi_samples=True is rejected before any session runs.

… session records Each chat-completion turn stored the full upstream response in its SessionRecord, including the all-token routed_experts/indexer_topk blob (~1KB/token over the whole prompt+output). Every turn's prompt is the full accumulated prefix, so these blobs overlap and grow per turn — a 64-trajectory x 50-turn x ~50k-context run retained tens of GB and the all-token run failed with 502. merge_samples folds per-turn samples last-wins, so only the most recent turn's blob is ever read. append_record now drops the blob from records that can no longer be the consumed tail, keeping the last MAX_ASSISTANT_ROLLBACK_STEPS + 1 (so a single rollback's promoted tail still carries its blob) and leaving logprobs and the consumer untouched. Retained size drops from O(turns*prefix) to O(prefix). This last-wins reasoning only holds when turns are merged, so SessionRegistry asserts generate_multi_samples is False; the agentic_tool_call_multi_samples test variant (session server + multi-sample output) is removed accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-06-22T18:43:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

guapisolo requested review from fzyzcjy, jybsuper, maocheng23 and yueming-yuan as code owners June 22, 2026 18:43

guapisolo requested review from Shi-Dong and Zhichenzzz as code owners June 23, 2026 02:27

guapisolo force-pushed the refactor/session-routed-experts-retention branch from 08bde32 to 9cf2a03 Compare June 23, 2026 06:57

guapisolo mentioned this pull request Jun 23, 2026

refactor(session): per-session in-flight gate, response passthrough, bounded CPU offload #1468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(session): drop superseded routed_experts/indexer_topk blobs from…#1463

perf(session): drop superseded routed_experts/indexer_topk blobs from…#1463
guapisolo wants to merge 1 commit into
mainfrom
refactor/session-routed-experts-retention

guapisolo commented Jun 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guapisolo commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Before / After

Behavior Preservation

Verification

Review Focus

Uh oh!

gemini-code-assist Bot commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guapisolo commented Jun 22, 2026 •

edited

Loading