Skip to content

perf(session): drop superseded routed_experts/indexer_topk blobs from…#1463

Open
guapisolo wants to merge 1 commit into
mainfrom
refactor/session-routed-experts-retention
Open

perf(session): drop superseded routed_experts/indexer_topk blobs from…#1463
guapisolo wants to merge 1 commit into
mainfrom
refactor/session-routed-experts-retention

Conversation

@guapisolo

@guapisolo guapisolo commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Drop superseded routed_experts/indexer_topk blobs from linear-trajectory session records.

Motivation

Each chat-completion turn stored the full upstream response in its SessionRecord, including the all-token routed_experts/indexer_topk blob (~1KB/token over the whole prompt+output). Every turn's prompt is the full accumulated prefix, so the blobs overlap and grow per turn — retained payload is O(turns × prefix). A 64-trajectory x 50-turn x ~50k-context run retained tens of GB; the all-token run failed with 502. The merged training sample and the per-turn output_token_logprobs stay unchanged.

Before / After

  • Before / After: before, every record kept its blob, so retained payload was O(turns × prefix); after, append_record drops the blob from records that can no longer be the merged tail, so retained payload is O(prefix).
  • What moved where: the strip lives in LinearTrajectory.append_record, retaining the last MAX_ASSISTANT_ROLLBACK_STEPS + 1 records' blobs.
  • SessionRegistry.__init__ gains a generate_multi_samples is False assert; the last-wins reasoning only holds when turns are merged.
  • The agentic_tool_call_multi_samples test variant is removed accordingly.

Behavior Preservation

  • How we know: test_stripping_superseded_records_preserves_merged_r3 replays a 3-turn chain, strips the first two records, then asserts the merged rollout_routed_experts is byte-identical to the all-present baseline.
  • test_keeps_last_two_and_preserves_logprobs asserts output_token_logprobs survive every strip.
  • test_single_rollback_leaves_surviving_tail_with_blob asserts a rolled-back tail still carries its blob.

Verification

Existing suite (test_linear_trajectory + test_openai_endpoint_utils + test_sessions + test_session_race_conditions) → 82 passed.

Retention microbenchmark drives the real append_record with synthetic blobs sized to the workload profile (~1KB/token, ~1k tokens/turn, so context reaches ~50k at turn 50). Per trajectory, retained blob bytes grow with turns pre-fix but stay O(prefix) after:

 turns | keep-all (pre-fix) MB | fixed (this PR) MB | reduction
    10 |                  53.7 |               18.6 |     2.9x
    20 |                 205.1 |               38.1 |     5.4x
    30 |                 454.1 |               57.6 |     7.9x
    40 |                 800.8 |               77.1 |    10.4x
    50 |                1245.1 |               96.7 |    12.9x

The session server is a singleton process (router_manager.py:116), so all in-flight sessions' retained sets coexist in one process. Building 64 concurrent sessions x 50 turns in one process measures the aggregate directly (not extrapolated), via tracemalloc live Python heap:

                   live Python heap | retained blobs
keep-all (pre-fix)        77.83 GB  |   77.82 GB     (substantiates "tens of GB" -> 502)
fixed (this PR)            6.05 GB  |    6.04 GB

Live heap equals retained blobs in both runs, confirming the popped blobs are released rather than held by a lingering reference. tracemalloc is used instead of RSS because glibc keeps freed chunks resident, which would over-report the fixed run.

Review Focus

  • Scrutinize the append_record strip index len(records) - 1 - (MAX_ASSISTANT_ROLLBACK_STEPS + 1) against _try_detect_and_rollback_to_assistant_checkpoint, which truncates records on a single-step rollback.
  • Scrutinize the SessionRegistry.__init__ assert: confirm the session-server subprocess receives the same args, so generate_multi_samples=True is rejected before any session runs.

… session records

Each chat-completion turn stored the full upstream response in its
SessionRecord, including the all-token routed_experts/indexer_topk blob
(~1KB/token over the whole prompt+output). Every turn's prompt is the full
accumulated prefix, so these blobs overlap and grow per turn — a 64-trajectory
x 50-turn x ~50k-context run retained tens of GB and the all-token run failed
with 502.

merge_samples folds per-turn samples last-wins, so only the most recent turn's
blob is ever read. append_record now drops the blob from records that can no
longer be the consumed tail, keeping the last MAX_ASSISTANT_ROLLBACK_STEPS + 1
(so a single rollback's promoted tail still carries its blob) and leaving
logprobs and the consumer untouched. Retained size drops from O(turns*prefix)
to O(prefix).

This last-wins reasoning only holds when turns are merged, so SessionRegistry
asserts generate_multi_samples is False; the agentic_tool_call_multi_samples
test variant (session server + multi-sample output) is removed accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant