fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs) by ajcasagrande · Pull Request #966 · ai-dynamo/aiperf

ajcasagrande · 2026-05-20T05:50:32Z

Summary

Fixes a latent AttributeError in main's DAG runtime: BranchOrchestrator calls self._sticky_router.register_child_routing(...) and release_child_routing(...) at five sites (_branch_orchestrator_spawn.py:197,257, branch_orchestrator.py:391,455,467) but those methods were never defined on StickyCreditRouter. Test suites mock them so CI is green — the moment a real FORK-mode DAG runs in prod, the orchestrator raises AttributeError at the first child dispatch.
Refactors _sticky_sessions from dict[str, str] to dict[str, _StickyEntry] so the parent's sticky binding can outlive its own final turn while DAG children pinning to its worker are still in-flight (ref_count + parent_final_seen lifecycle).
Adds a subagent_type: str \| None field on ConversationBranchInfo (SPAWN-only) so loaders can pass through upstream-recorded values like "Explore" or "Codex Subagent" verbatim without enum-narrowing.
Optimizes send_credit: the dominant single-turn-root-no-forks workload (most AIPerf benchmarks) no longer pays an extra dict.get per credit — the final-turn lifecycle gates on sticky_entry is not None directly instead of falling back to a second dict lookup.

Hot-path impact

Hot-path performance was a deliberate constraint (per the StickyCreditRouter docstring: "Methods are intentionally large/inlined to avoid function call overhead"). Per-credit cost analysis:

Sticky-hit common case: +1 attribute deref (sticky_entry.worker_id), +1 or-eval on routing key. ~15-25ns. Below noise at typical AIPerf throughputs.
Single-turn root no-forks (dominant non-DAG workload): no _StickyEntry allocation, no second dict.get (thanks to the optimization). Effectively unchanged from main.
Session creation (cold-ish): _StickyEntry(slots=True) allocation instead of dict[k] = str. ~200ns added, once per multi-turn session.
Atomicity invariant preserved: zero new awaits in send_credit.

Lifecycle (new semantics)

                                       parent_correlation_id?  is_final_turn?  has_forks?
                                       ───────────────────────  ──────────────  ──────────
Single-turn root, no forks              None                    True            False     → no entry created, no allocation
Single-turn root, declares spawns       None                    True            True      → entry created + survives (register_child_routing finds it)
Multi-turn root (any non-final turn)    None                    False           any       → entry created/reused
Multi-turn root, final turn, no forks   None                    True            False     → entry decremented + popped
DAG child credit                        set                     any             any       → pins to parent's worker via routing_key, lifecycle block skipped

register_child_routing(parent_corr) increments refcount; logs a warning if the parent has no entry (graceful degradation — child routes via least-loaded, loses prefix-cache locality but stays correct).

release_child_routing(parent_corr) decrements refcount; evicts the entry only when both ref_count <= 0 AND parent_final_seen.

subagent_type field

Added because the Weka kv-cache-tester trace schema already carries subagent_type: str per recorded subagent (real corpus values: "Explore", "Codex Subagent", etc. — free-form, set by upstream Claude Code proxy). Typing the new aiperf-side field as str \| None rather than a narrow enum lets loaders pass these through verbatim without filter/map logic and without rejecting new agent types as corpora evolve.

The field validator still enforces the SPAWN-only invariant (FORK children inherit the parent's role, so they have no separate classification).

Test plan

tests/unit/credit/test_sticky_router.py::TestStickyCreditRouterChildRoutingRefcount (6 tests): register/release refcount semantics, eviction conditions, missing-parent warning, no-op-on-unknown-parent.
tests/unit/credit/test_sticky_router.py::TestStickyCreditRouterSendCreditFinalTurnLifecycle (4 tests): single-turn root no-forks (no entry created), final turn with forks (entry survives for register_child_routing), DAG-child final turn (parent untouched), multi-turn final turn (entry popped on no-forks).
tests/unit/common/models/test_branch_model.py::TestConversationBranchInfoSubagentType (6 tests): SPAWN accepts "Explore" and arbitrary "Codex Subagent", FORK rejects non-None, round-trip.
Full unit suite: 12,689 passed, 79 skipped, 0 failed.
Component integration (DAG + sticky-routing): test_dag_adversarial_timing_modes, test_dag_combined_pathology, test_sticky_routing — 64 passed.
pre-commit run --files <touched> clean.
Real FORK-DAG benchmark run (any reviewer with hardware): verify the orchestrator no longer raises AttributeError on child dispatch.

Guardrails

StickyCreditRouter.send_credit grandfathered in tools/ergonomics_baseline.json (function-size 90 > 80) and tools/ruff_baseline.json (C901 cyclomatic 12 > 10, PLR0912 branches 13 > 12). The hot path is intentionally inlined per the class docstring; extracting helpers would add per-credit function-call overhead. Baseline additions restricted to exactly these two new entries; unrelated stale entries left untouched.

Related branches

ajc/dag4 (1 commit ahead of main, dated May) had an earlier version of this fix bundled with the Weka loader stack and a narrow SubagentType enum — appears abandoned, no merges from main since.
ajc/dag5 (17 commits ahead of main, actively being worked) is the successor of PR feat(dag): conversation DAG benchmarks (dag_jsonl) — FORK / SPAWN / f… #891 and still carries the same latent AttributeError. Happy to retarget this PR onto dag5 if you'd prefer to fold the fix into that review cycle.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added support for optional subagent_type specification in conversation branches for improved branch configuration.
- Enhanced routing system with reference counting to better manage child session lifecycles in multi-branch conversations.
- Introduced methods for managing child session routing registration and release.
Tests
- Added validation tests for branch configuration constraints.
- Expanded routing tests for complex session lifecycle scenarios.

… sticky entry BranchOrchestrator already calls `self._sticky_router.register_child_routing(...)` and `release_child_routing(...)` at five sites (_branch_orchestrator_spawn.py:197,257, branch_orchestrator.py:391,455,467) but those methods are never defined on `StickyCreditRouter`. Test suites mock them, which is why CI is green, but the moment a real FORK-mode DAG runs in prod the orchestrator raises `AttributeError` at the first child dispatch. Sticky router - `_sticky_sessions` changes from `dict[str, str]` to `dict[str, _StickyEntry]`. The new dataclass tracks `worker_id`, `ref_count`, and `parent_final_seen` so the parent's sticky binding can outlive its own final turn while DAG children pinning to that worker are still in-flight. - `send_credit` resolves a `routing_key = parent_correlation_id or x_correlation_id` so DAG children pin to their parent's worker (was already the intent; nothing previously enforced it from the router side). Entry creation extends to "final turn that declares DAG spawns" so `register_child_routing` finds an entry to bump. Final-turn lifecycle decrements `ref_count` and flips `parent_final_seen`; eviction still requires both `ref_count <= 0` and `parent_final_seen`, and is suppressed when `has_forks`. - `register_child_routing(parent_corr)`: increments refcount. If the parent has no entry, logs a warning and continues (child routes via least-loaded — loses prefix-cache locality but stays correct). - `release_child_routing(parent_corr)`: decrements refcount; evicts only when both conditions above hold. Hot-path note: the final-turn lifecycle block is gated by `sticky_entry is not None`, not by a fallback `dict.get()`. The dominant single-turn-root-no-forks workload (most AIPerf benchmarks) takes the same dict-lookup count as before — no entry created, no second get, no allocation. ConversationBranchInfo - New `subagent_type: str | None` field. SPAWN-only (validator rejects on FORK). Free-form so upstream-recorded values like "Explore" or "Codex Subagent" pass through verbatim without enum-narrowing. Tests - TestStickyCreditRouterChildRoutingRefcount (6 tests): register/release refcount semantics, eviction conditions, missing-parent warning, idempotency. - TestStickyCreditRouterSendCreditFinalTurnLifecycle (4 tests): single-turn root no-forks (no entry created), final turn with forks (entry survives for register_child_routing), DAG-child final turn (parent entry untouched), multi-turn final turn (entry popped on no-forks). - TestConversationBranchInfoSubagentType (6 tests): SPAWN accepts "Explore" and arbitrary "Codex Subagent", FORK rejects non-None, round-trip. - Existing tests updated to use `_StickyEntry(worker_id=...)` for direct dict pokes (test_sticky_router.py, test_race_conditions.py). Guardrails - `StickyCreditRouter.send_credit` grandfathered in tools/ergonomics_baseline.json (function-size 90 > 80) and tools/ruff_baseline.json (C901, PLR0912) — the hot path is intentionally inlined per the class docstring's "Methods are intentionally large/inlined" note. Baseline regenerations restricted to these two new entries; unrelated stale entries left untouched. Verification: 12,689 unit tests pass + 64 DAG/sticky-routing component_integration tests pass; pre-commit clean. Related: official/ajc/dag4 (abandoned) had an earlier version of this fix bundled with the Weka loader stack and a narrow SubagentType enum. official/ajc/dag5 is the active DAG PR and still carries the same latent bug. If preferred, this can be retargeted onto dag5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

github-actions · 2026-05-20T05:50:44Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c53ea483b3d9e48fd006a82e68385a56108f5a6

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c53ea483b3d9e48fd006a82e68385a56108f5a6

Last updated for commit: 2c53ea4 • Browse code

coderabbitai · 2026-05-20T05:54:18Z

Walkthrough

This PR introduces subagent-type metadata to conversation branching with SPAWN-mode validation, and refactors sticky-session state management to track refcounts and parent final-turn bookkeeping, enabling safe child routing past parent completion through new lifecycle methods.

Changes

Sticky Router Enhancement and Branch Model Metadata

Layer / File(s)	Summary
Branch model subagent_type field and validation `src/aiperf/common/models/branch.py`, `tests/unit/common/models/test_branch_model.py`	`ConversationBranchInfo` adds optional `subagent_type` field with field validator rejecting non-None values when `mode` is FORK. Test suite validates default handling, string acceptance, rejection for FORK, and model round-tripping.
Sticky entry data structure `src/aiperf/credit/sticky_router.py`	Introduces `_StickyEntry` dataclass to track sticky-session ownership, ref count, and parent final-turn state for lifecycle-based eviction.
Sticky router storage refactoring `src/aiperf/credit/sticky_router.py`, `tests/unit/credit/test_sticky_router.py`, `tests/unit/timing/test_race_conditions.py`	Replaces `routing_key -> worker_id` mapping with `routing_key -> _StickyEntry`. Updates routing lookup to derive keys from parent or session correlation ID. Test assertions and setup across multiple scenarios updated to reference `_StickyEntry.worker_id`.
Sticky router lifecycle and child routing methods `src/aiperf/credit/sticky_router.py`	`send_credit` now creates/binds entries for non-final or fork scenarios, sets `parent_final_seen` and decrements refcount on final turns, and evicts only when refcount reaches zero with no forks. New `register_child_routing` and `release_child_routing` methods manage refcount for parent correlation IDs, evicting after parent final turn is observed.
Comprehensive sticky router refcount and final-turn lifecycle tests `tests/unit/credit/test_sticky_router.py`	New test suite covering `_StickyEntry` refcount lifecycle for child routing including eviction conditions, and detailed `send_credit` final-turn paths for single-turn roots, fork scenarios, DAG child short-circuits, and multi-turn sessions.
Code quality baseline updates `tools/ergonomics_baseline.json`, `tools/ruff_baseline.json`	Ergonomics and Ruff baselines record violations for expanded `StickyCreditRouter.send_credit` method (function-size, C901, PLR0912).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A branch now knows its type so fine,
Sticky routers count with designs divine,
Entries dance through parent and child,
Refcounts keep sessions reconciled,
DAG routing, finally unfiled!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main fix: defining missing register/release_child_routing methods to address a latent AttributeError in FORK DAGs.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unit/credit/test_sticky_router.py (1)

48-48: 💤 Low value

Consider using next(iter()) for single-element access.

Static analysis suggests this minor optimization to avoid creating an intermediate list.

Suggested change

-        assert list(router._sticky_sessions.values())[0].worker_id == "worker-2"
+        assert next(iter(router._sticky_sessions.values())).worker_id == "worker-2"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/credit/test_sticky_router.py` at line 48, Replace the intermediate
list creation when asserting the single sticky session: instead of converting
router._sticky_sessions.values() to a list and indexing [0], use
next(iter(router._sticky_sessions.values())) to retrieve the first (and only)
element; update the assertion that checks its worker_id (the existing assertion
referencing router._sticky_sessions.values() and worker_id == "worker-2") to use
next(iter(...)) for more efficient single-element access.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/credit/test_sticky_router.py`:
- Line 48: Replace the intermediate list creation when asserting the single
sticky session: instead of converting router._sticky_sessions.values() to a list
and indexing [0], use next(iter(router._sticky_sessions.values())) to retrieve
the first (and only) element; update the assertion that checks its worker_id
(the existing assertion referencing router._sticky_sessions.values() and
worker_id == "worker-2") to use next(iter(...)) for more efficient
single-element access.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4cae323f-a088-49ab-8bf3-c12659dac25f

📥 Commits

Reviewing files that changed from the base of the PR and between bb6f421 and 541b93f.

📒 Files selected for processing (7)

src/aiperf/common/models/branch.py
src/aiperf/credit/sticky_router.py
tests/unit/common/models/test_branch_model.py
tests/unit/credit/test_sticky_router.py
tests/unit/timing/test_race_conditions.py
tools/ergonomics_baseline.json
tools/ruff_baseline.json

codecov · 2026-05-20T05:57:42Z

Codecov Report

❌ Patch coverage is 96.15385% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/credit/sticky_router.py	95.55%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…ion / orchestrator (PR-A) Sits on top of 92d1342 (the foundation: CacheBustTarget enum + build_cache_bust_marker). Plumbs the marker plumbing end-to-end without actually injecting markers anywhere - injection at request-build time is deferred to a follow-up that matches the source-of-truth split (cjq/agentx-v0.3 injects at workers/worker.py::_apply_cache_bust, NOT at compose time). This PR is the foundation a worker-layer follow-up will plug into. Config - src/aiperf/config/dataset/content.py: new CacheBustConfig with ``target: CacheBustTarget = NONE`` sub-model on PromptConfig. - src/aiperf/config/dataset/__init__.py: re-export CacheBustConfig. - src/aiperf/config/schema/aiperf-config.schema.json: auto-regenerated. Credit + session - src/aiperf/credit/structs.py: Credit + CreditContext gain ``cache_bust_target: CacheBustTarget`` and ``cache_bust_marker: str | None`` fields with sensible defaults; msgspec ``omit_defaults`` keeps the wire footprint unchanged when cache-bust is disabled. - src/aiperf/timing/conversation_source.py: SampledSession carries the same fields. ``ConversationSource.start_branch_child`` and ``start_pre_session_child`` accept ``cache_bust_marker`` and ``cache_bust_target`` as keyword-only kwargs with NONE/None defaults so existing callers stay working. - src/aiperf/common/models/record_models.py: RecordContext and MetricRecordInfo carry cache_bust_target for raw-JSONL export. Orchestrator - src/aiperf/timing/branch_orchestrator.py: ``_mint_child_marker (child_conversation_id) -> str | None`` method calls ``build_cache_bust_marker(benchmark_id, 0, 0, child_conversation_id, target=self._cache_bust_target)``. New ``cache_bust_target`` ``__init__`` kwarg defaulting to NONE (PhaseRunner construction site wires the real value later - out of PR-A scope). Returns None when target=NONE. - src/aiperf/timing/_branch_orchestrator_spawn.py: spawn path passes ``cache_bust_marker=self._mint_child_marker(child_conv_id)`` to ``start_branch_child`` unconditionally. Architectural note cjq/agentx-v0.3 injects cache-bust markers at the WORKER layer (``workers/worker.py:_apply_cache_bust*`` lines 87/257/858/964) - the composer there only does ISL-budget compensation for marker tokens (``composer/base.py`` lines 130-175). This PR deliberately does not add a compose-time injection path: a worker-layer follow-up will implement the request-build-time injection that matches the source-of- truth flow. PR-A's marker fields exist as fields on Credit/Session; they are populated by the orchestrator (for SPAWN children) but the worker that ultimately consumes them is not modified here. Note on ScenarioSpec: an earlier draft added a minimal ``src/aiperf/common/scenario/{__init__.py,base.py}`` carrying a ``ScenarioSpec`` primitive with a ``require_cache_bust`` field. Removed before merge - the data structure has no consumer in this PR (no ``_run_scenario_validator``, no scenario registry, no CLI surface), so it would be dead code until PR-F (InferenceX AgentX-MVP scenario) introduces an actual scenario that uses it. ScenarioSpec lands with PR-F. Tests - tests/unit/common/config/test_cache_bust_config.py - 19 tests for CacheBustConfig defaults, value validation, round-trip. - tests/unit/credit/test_credit_cache_bust_fields.py - 10 tests for Credit + CreditContext field defaults, explicit values, msgspec wire round-trip with omit_defaults, ``TurnToSend.from_previous_credit`` field propagation. - tests/unit/timing/test_conversation_source_cache_bust.py - 7 tests covering SampledSession field plumbing, ``build_first_turn`` propagation, and the keyword-only kwargs on the two start_*_child signatures. - tests/unit/common/models/test_record_models_cache_bust.py - 7 tests for RecordContext + MetricRecordInfo defaults and orjson round-trip. - tests/unit/timing/test_branch_orchestrator_cache_bust.py - 8 tests covering _mint_child_marker: NONE returns None, deterministic for same (benchmark_id, child_conversation_id, target), distinct per-child_conversation_id and per-benchmark_id. Verification - 12,755 unit tests pass + 79 skipped + 0 failed (up from 12,689 on the post-#966 baseline; +66 net new). - pre-commit run --files <touched> clean (including auto-regenerated config schema and ruff format). Cross-slice integration TODOs (for follow-up PRs) 1. PhaseRunner construction site needs to thread ``cache_bust_target=user_config.input.prompt.cache_bust.target`` and the real ``benchmark_id`` into BranchOrchestrator(). Until done, _mint_child_marker always returns None - safe default. 2. Worker-layer follow-up (out of PR-A scope) needs to port the ``_apply_cache_bust_to_system_message`` + ``_apply_cache_bust`` logic from cjq/agentx-v0.3 (workers/worker.py:87, 257) and read ``credit.cache_bust_marker`` / ``credit.cache_bust_target`` to inject markers into the wire payload at request-build time. 3. The raw-record export path needs to copy ``credit.cache_bust_marker`` onto RecordContext + MetricRecordInfo so JSONL captures the rid for downstream analysis. 4. ScenarioSpec primitive lands with PR-F (the first scenario that uses it). That PR will add ``require_cache_bust`` plus the broader scenario-locking machinery. 5. CLI surface: ``--cache-bust`` flag not yet wired through the v2 converter (``config/flags/_converter_dataset.py``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

dynamo-ops · 2026-05-20T06:50:35Z

-        x_correlation_id = credit.x_correlation_id
-        sticky_worker_id = self._sticky_sessions.get(x_correlation_id)
+        # DAG children pin to their parent's worker; otherwise pin to self.
+        routing_key = credit.parent_correlation_id or credit.x_correlation_id


SPAWN child credits also carry parent_correlation_id, so this unconditionally keys them under the parent while the final-turn cleanup skips all parent_correlation_id credits, leaking multi-turn SPAWN sticky sessions and forcing sibling SPAWN children onto one worker. Fix: only use parent_correlation_id and the refcount lifecycle for credit.branch_mode == ConversationBranchMode.FORK; route and clean up SPAWN children by their own x_correlation_id.

github-actions Bot added the fix label May 20, 2026

ajcasagrande requested review from FrankD412 and debermudez May 20, 2026 05:53

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

ajcasagrande mentioned this pull request May 20, 2026

feat(cache-bust): CacheBustTarget plumbing through config / credit / session / orchestrator (PR-A) #967

Open

3 tasks

dynamo-ops reviewed May 20, 2026

View reviewed changes

ajcasagrande added the qa-bugfix Important fix for QA bugs for release label May 21, 2026

Merge branch 'main' into cjq/agentx-on-main-runtime

2c53ea4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs)#966

fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs)#966
ajcasagrande wants to merge 2 commits into
mainfrom
cjq/agentx-on-main-runtime

ajcasagrande commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

dynamo-ops May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajcasagrande commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Hot-path impact

Lifecycle (new semantics)

subagent_type field

Test plan

Guardrails

Related branches

Summary by CodeRabbit

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

dynamo-ops May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajcasagrande commented May 20, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 20, 2026 •

edited

Loading