fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs)#966
fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs)#966ajcasagrande wants to merge 2 commits into
Conversation
… sticky entry BranchOrchestrator already calls `self._sticky_router.register_child_routing(...)` and `release_child_routing(...)` at five sites (_branch_orchestrator_spawn.py:197,257, branch_orchestrator.py:391,455,467) but those methods are never defined on `StickyCreditRouter`. Test suites mock them, which is why CI is green, but the moment a real FORK-mode DAG runs in prod the orchestrator raises `AttributeError` at the first child dispatch. Sticky router - `_sticky_sessions` changes from `dict[str, str]` to `dict[str, _StickyEntry]`. The new dataclass tracks `worker_id`, `ref_count`, and `parent_final_seen` so the parent's sticky binding can outlive its own final turn while DAG children pinning to that worker are still in-flight. - `send_credit` resolves a `routing_key = parent_correlation_id or x_correlation_id` so DAG children pin to their parent's worker (was already the intent; nothing previously enforced it from the router side). Entry creation extends to "final turn that declares DAG spawns" so `register_child_routing` finds an entry to bump. Final-turn lifecycle decrements `ref_count` and flips `parent_final_seen`; eviction still requires both `ref_count <= 0` and `parent_final_seen`, and is suppressed when `has_forks`. - `register_child_routing(parent_corr)`: increments refcount. If the parent has no entry, logs a warning and continues (child routes via least-loaded — loses prefix-cache locality but stays correct). - `release_child_routing(parent_corr)`: decrements refcount; evicts only when both conditions above hold. Hot-path note: the final-turn lifecycle block is gated by `sticky_entry is not None`, not by a fallback `dict.get()`. The dominant single-turn-root-no-forks workload (most AIPerf benchmarks) takes the same dict-lookup count as before — no entry created, no second get, no allocation. ConversationBranchInfo - New `subagent_type: str | None` field. SPAWN-only (validator rejects on FORK). Free-form so upstream-recorded values like "Explore" or "Codex Subagent" pass through verbatim without enum-narrowing. Tests - TestStickyCreditRouterChildRoutingRefcount (6 tests): register/release refcount semantics, eviction conditions, missing-parent warning, idempotency. - TestStickyCreditRouterSendCreditFinalTurnLifecycle (4 tests): single-turn root no-forks (no entry created), final turn with forks (entry survives for register_child_routing), DAG-child final turn (parent entry untouched), multi-turn final turn (entry popped on no-forks). - TestConversationBranchInfoSubagentType (6 tests): SPAWN accepts "Explore" and arbitrary "Codex Subagent", FORK rejects non-None, round-trip. - Existing tests updated to use `_StickyEntry(worker_id=...)` for direct dict pokes (test_sticky_router.py, test_race_conditions.py). Guardrails - `StickyCreditRouter.send_credit` grandfathered in tools/ergonomics_baseline.json (function-size 90 > 80) and tools/ruff_baseline.json (C901, PLR0912) — the hot path is intentionally inlined per the class docstring's "Methods are intentionally large/inlined" note. Baseline regenerations restricted to these two new entries; unrelated stale entries left untouched. Verification: 12,689 unit tests pass + 64 DAG/sticky-routing component_integration tests pass; pre-commit clean. Related: official/ajc/dag4 (abandoned) had an earlier version of this fix bundled with the Weka loader stack and a narrow SubagentType enum. official/ajc/dag5 is the active DAG PR and still carries the same latent bug. If preferred, this can be retargeted onto dag5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c53ea483b3d9e48fd006a82e68385a56108f5a6Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c53ea483b3d9e48fd006a82e68385a56108f5a6Last updated for commit: |
WalkthroughThis PR introduces subagent-type metadata to conversation branching with SPAWN-mode validation, and refactors sticky-session state management to track refcounts and parent final-turn bookkeeping, enabling safe child routing past parent completion through new lifecycle methods. ChangesSticky Router Enhancement and Branch Model Metadata
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/credit/test_sticky_router.py (1)
48-48: 💤 Low valueConsider using
next(iter())for single-element access.Static analysis suggests this minor optimization to avoid creating an intermediate list.
Suggested change
- assert list(router._sticky_sessions.values())[0].worker_id == "worker-2" + assert next(iter(router._sticky_sessions.values())).worker_id == "worker-2"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/credit/test_sticky_router.py` at line 48, Replace the intermediate list creation when asserting the single sticky session: instead of converting router._sticky_sessions.values() to a list and indexing [0], use next(iter(router._sticky_sessions.values())) to retrieve the first (and only) element; update the assertion that checks its worker_id (the existing assertion referencing router._sticky_sessions.values() and worker_id == "worker-2") to use next(iter(...)) for more efficient single-element access.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/credit/test_sticky_router.py`:
- Line 48: Replace the intermediate list creation when asserting the single
sticky session: instead of converting router._sticky_sessions.values() to a list
and indexing [0], use next(iter(router._sticky_sessions.values())) to retrieve
the first (and only) element; update the assertion that checks its worker_id
(the existing assertion referencing router._sticky_sessions.values() and
worker_id == "worker-2") to use next(iter(...)) for more efficient
single-element access.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4cae323f-a088-49ab-8bf3-c12659dac25f
📒 Files selected for processing (7)
src/aiperf/common/models/branch.pysrc/aiperf/credit/sticky_router.pytests/unit/common/models/test_branch_model.pytests/unit/credit/test_sticky_router.pytests/unit/timing/test_race_conditions.pytools/ergonomics_baseline.jsontools/ruff_baseline.json
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ion / orchestrator (PR-A) Sits on top of 92d1342 (the foundation: CacheBustTarget enum + build_cache_bust_marker). Plumbs the marker plumbing end-to-end without actually injecting markers anywhere - injection at request-build time is deferred to a follow-up that matches the source-of-truth split (cjq/agentx-v0.3 injects at workers/worker.py::_apply_cache_bust, NOT at compose time). This PR is the foundation a worker-layer follow-up will plug into. Config - src/aiperf/config/dataset/content.py: new CacheBustConfig with ``target: CacheBustTarget = NONE`` sub-model on PromptConfig. - src/aiperf/config/dataset/__init__.py: re-export CacheBustConfig. - src/aiperf/config/schema/aiperf-config.schema.json: auto-regenerated. Credit + session - src/aiperf/credit/structs.py: Credit + CreditContext gain ``cache_bust_target: CacheBustTarget`` and ``cache_bust_marker: str | None`` fields with sensible defaults; msgspec ``omit_defaults`` keeps the wire footprint unchanged when cache-bust is disabled. - src/aiperf/timing/conversation_source.py: SampledSession carries the same fields. ``ConversationSource.start_branch_child`` and ``start_pre_session_child`` accept ``cache_bust_marker`` and ``cache_bust_target`` as keyword-only kwargs with NONE/None defaults so existing callers stay working. - src/aiperf/common/models/record_models.py: RecordContext and MetricRecordInfo carry cache_bust_target for raw-JSONL export. Orchestrator - src/aiperf/timing/branch_orchestrator.py: ``_mint_child_marker (child_conversation_id) -> str | None`` method calls ``build_cache_bust_marker(benchmark_id, 0, 0, child_conversation_id, target=self._cache_bust_target)``. New ``cache_bust_target`` ``__init__`` kwarg defaulting to NONE (PhaseRunner construction site wires the real value later - out of PR-A scope). Returns None when target=NONE. - src/aiperf/timing/_branch_orchestrator_spawn.py: spawn path passes ``cache_bust_marker=self._mint_child_marker(child_conv_id)`` to ``start_branch_child`` unconditionally. Architectural note cjq/agentx-v0.3 injects cache-bust markers at the WORKER layer (``workers/worker.py:_apply_cache_bust*`` lines 87/257/858/964) - the composer there only does ISL-budget compensation for marker tokens (``composer/base.py`` lines 130-175). This PR deliberately does not add a compose-time injection path: a worker-layer follow-up will implement the request-build-time injection that matches the source-of- truth flow. PR-A's marker fields exist as fields on Credit/Session; they are populated by the orchestrator (for SPAWN children) but the worker that ultimately consumes them is not modified here. Note on ScenarioSpec: an earlier draft added a minimal ``src/aiperf/common/scenario/{__init__.py,base.py}`` carrying a ``ScenarioSpec`` primitive with a ``require_cache_bust`` field. Removed before merge - the data structure has no consumer in this PR (no ``_run_scenario_validator``, no scenario registry, no CLI surface), so it would be dead code until PR-F (InferenceX AgentX-MVP scenario) introduces an actual scenario that uses it. ScenarioSpec lands with PR-F. Tests - tests/unit/common/config/test_cache_bust_config.py - 19 tests for CacheBustConfig defaults, value validation, round-trip. - tests/unit/credit/test_credit_cache_bust_fields.py - 10 tests for Credit + CreditContext field defaults, explicit values, msgspec wire round-trip with omit_defaults, ``TurnToSend.from_previous_credit`` field propagation. - tests/unit/timing/test_conversation_source_cache_bust.py - 7 tests covering SampledSession field plumbing, ``build_first_turn`` propagation, and the keyword-only kwargs on the two start_*_child signatures. - tests/unit/common/models/test_record_models_cache_bust.py - 7 tests for RecordContext + MetricRecordInfo defaults and orjson round-trip. - tests/unit/timing/test_branch_orchestrator_cache_bust.py - 8 tests covering _mint_child_marker: NONE returns None, deterministic for same (benchmark_id, child_conversation_id, target), distinct per-child_conversation_id and per-benchmark_id. Verification - 12,755 unit tests pass + 79 skipped + 0 failed (up from 12,689 on the post-#966 baseline; +66 net new). - pre-commit run --files <touched> clean (including auto-regenerated config schema and ruff format). Cross-slice integration TODOs (for follow-up PRs) 1. PhaseRunner construction site needs to thread ``cache_bust_target=user_config.input.prompt.cache_bust.target`` and the real ``benchmark_id`` into BranchOrchestrator(). Until done, _mint_child_marker always returns None - safe default. 2. Worker-layer follow-up (out of PR-A scope) needs to port the ``_apply_cache_bust_to_system_message`` + ``_apply_cache_bust`` logic from cjq/agentx-v0.3 (workers/worker.py:87, 257) and read ``credit.cache_bust_marker`` / ``credit.cache_bust_target`` to inject markers into the wire payload at request-build time. 3. The raw-record export path needs to copy ``credit.cache_bust_marker`` onto RecordContext + MetricRecordInfo so JSONL captures the rid for downstream analysis. 4. ScenarioSpec primitive lands with PR-F (the first scenario that uses it). That PR will add ``require_cache_bust`` plus the broader scenario-locking machinery. 5. CLI surface: ``--cache-bust`` flag not yet wired through the v2 converter (``config/flags/_converter_dataset.py``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
| x_correlation_id = credit.x_correlation_id | ||
| sticky_worker_id = self._sticky_sessions.get(x_correlation_id) | ||
| # DAG children pin to their parent's worker; otherwise pin to self. | ||
| routing_key = credit.parent_correlation_id or credit.x_correlation_id |
There was a problem hiding this comment.
SPAWN child credits also carry parent_correlation_id, so this unconditionally keys them under the parent while the final-turn cleanup skips all parent_correlation_id credits, leaking multi-turn SPAWN sticky sessions and forcing sibling SPAWN children onto one worker. Fix: only use parent_correlation_id and the refcount lifecycle for credit.branch_mode == ConversationBranchMode.FORK; route and clean up SPAWN children by their own x_correlation_id.
Summary
AttributeErrorin main's DAG runtime:BranchOrchestratorcallsself._sticky_router.register_child_routing(...)andrelease_child_routing(...)at five sites (_branch_orchestrator_spawn.py:197,257,branch_orchestrator.py:391,455,467) but those methods were never defined onStickyCreditRouter. Test suites mock them so CI is green — the moment a real FORK-mode DAG runs in prod, the orchestrator raisesAttributeErrorat the first child dispatch._sticky_sessionsfromdict[str, str]todict[str, _StickyEntry]so the parent's sticky binding can outlive its own final turn while DAG children pinning to its worker are still in-flight (ref_count+parent_final_seenlifecycle).subagent_type: str \| Nonefield onConversationBranchInfo(SPAWN-only) so loaders can pass through upstream-recorded values like"Explore"or"Codex Subagent"verbatim without enum-narrowing.send_credit: the dominant single-turn-root-no-forks workload (most AIPerf benchmarks) no longer pays an extradict.getper credit — the final-turn lifecycle gates onsticky_entry is not Nonedirectly instead of falling back to a second dict lookup.Hot-path impact
Hot-path performance was a deliberate constraint (per the
StickyCreditRouterdocstring: "Methods are intentionally large/inlined to avoid function call overhead"). Per-credit cost analysis:sticky_entry.worker_id), +1or-eval on routing key. ~15-25ns. Below noise at typical AIPerf throughputs._StickyEntryallocation, no seconddict.get(thanks to the optimization). Effectively unchanged from main._StickyEntry(slots=True)allocation instead ofdict[k] = str. ~200ns added, once per multi-turn session.awaits insend_credit.Lifecycle (new semantics)
register_child_routing(parent_corr)increments refcount; logs a warning if the parent has no entry (graceful degradation — child routes via least-loaded, loses prefix-cache locality but stays correct).release_child_routing(parent_corr)decrements refcount; evicts the entry only when bothref_count <= 0ANDparent_final_seen.subagent_type field
Added because the Weka kv-cache-tester trace schema already carries
subagent_type: strper recorded subagent (real corpus values:"Explore","Codex Subagent", etc. — free-form, set by upstream Claude Code proxy). Typing the new aiperf-side field asstr \| Nonerather than a narrow enum lets loaders pass these through verbatim without filter/map logic and without rejecting new agent types as corpora evolve.The field validator still enforces the SPAWN-only invariant (FORK children inherit the parent's role, so they have no separate classification).
Test plan
tests/unit/credit/test_sticky_router.py::TestStickyCreditRouterChildRoutingRefcount(6 tests): register/release refcount semantics, eviction conditions, missing-parent warning, no-op-on-unknown-parent.tests/unit/credit/test_sticky_router.py::TestStickyCreditRouterSendCreditFinalTurnLifecycle(4 tests): single-turn root no-forks (no entry created), final turn with forks (entry survives forregister_child_routing), DAG-child final turn (parent untouched), multi-turn final turn (entry popped on no-forks).tests/unit/common/models/test_branch_model.py::TestConversationBranchInfoSubagentType(6 tests): SPAWN accepts"Explore"and arbitrary"Codex Subagent", FORK rejects non-None, round-trip.test_dag_adversarial_timing_modes,test_dag_combined_pathology,test_sticky_routing— 64 passed.pre-commit run --files <touched>clean.AttributeErroron child dispatch.Guardrails
StickyCreditRouter.send_creditgrandfathered intools/ergonomics_baseline.json(function-size 90 > 80) andtools/ruff_baseline.json(C901 cyclomatic 12 > 10, PLR0912 branches 13 > 12). The hot path is intentionally inlined per the class docstring; extracting helpers would add per-credit function-call overhead. Baseline additions restricted to exactly these two new entries; unrelated stale entries left untouched.Related branches
ajc/dag4(1 commit ahead of main, dated May) had an earlier version of this fix bundled with the Weka loader stack and a narrowSubagentTypeenum — appears abandoned, no merges from main since.ajc/dag5(17 commits ahead of main, actively being worked) is the successor of PR feat(dag): conversation DAG benchmarks (dag_jsonl) — FORK / SPAWN / f… #891 and still carries the same latentAttributeError. Happy to retarget this PR ontodag5if you'd prefer to fold the fix into that review cycle.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
subagent_typespecification in conversation branches for improved branch configuration.Tests