Skip to content

fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs)#966

Open
ajcasagrande wants to merge 2 commits into
mainfrom
cjq/agentx-on-main-runtime
Open

fix(credit-router): define register/release_child_routing methods (latent AttributeError on FORK DAGs)#966
ajcasagrande wants to merge 2 commits into
mainfrom
cjq/agentx-on-main-runtime

Conversation

@ajcasagrande
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande commented May 20, 2026

Summary

  • Fixes a latent AttributeError in main's DAG runtime: BranchOrchestrator calls self._sticky_router.register_child_routing(...) and release_child_routing(...) at five sites (_branch_orchestrator_spawn.py:197,257, branch_orchestrator.py:391,455,467) but those methods were never defined on StickyCreditRouter. Test suites mock them so CI is green — the moment a real FORK-mode DAG runs in prod, the orchestrator raises AttributeError at the first child dispatch.
  • Refactors _sticky_sessions from dict[str, str] to dict[str, _StickyEntry] so the parent's sticky binding can outlive its own final turn while DAG children pinning to its worker are still in-flight (ref_count + parent_final_seen lifecycle).
  • Adds a subagent_type: str \| None field on ConversationBranchInfo (SPAWN-only) so loaders can pass through upstream-recorded values like "Explore" or "Codex Subagent" verbatim without enum-narrowing.
  • Optimizes send_credit: the dominant single-turn-root-no-forks workload (most AIPerf benchmarks) no longer pays an extra dict.get per credit — the final-turn lifecycle gates on sticky_entry is not None directly instead of falling back to a second dict lookup.

Hot-path impact

Hot-path performance was a deliberate constraint (per the StickyCreditRouter docstring: "Methods are intentionally large/inlined to avoid function call overhead"). Per-credit cost analysis:

  • Sticky-hit common case: +1 attribute deref (sticky_entry.worker_id), +1 or-eval on routing key. ~15-25ns. Below noise at typical AIPerf throughputs.
  • Single-turn root no-forks (dominant non-DAG workload): no _StickyEntry allocation, no second dict.get (thanks to the optimization). Effectively unchanged from main.
  • Session creation (cold-ish): _StickyEntry(slots=True) allocation instead of dict[k] = str. ~200ns added, once per multi-turn session.
  • Atomicity invariant preserved: zero new awaits in send_credit.

Lifecycle (new semantics)

                                       parent_correlation_id?  is_final_turn?  has_forks?
                                       ───────────────────────  ──────────────  ──────────
Single-turn root, no forks              None                    True            False     → no entry created, no allocation
Single-turn root, declares spawns       None                    True            True      → entry created + survives (register_child_routing finds it)
Multi-turn root (any non-final turn)    None                    False           any       → entry created/reused
Multi-turn root, final turn, no forks   None                    True            False     → entry decremented + popped
DAG child credit                        set                     any             any       → pins to parent's worker via routing_key, lifecycle block skipped

register_child_routing(parent_corr) increments refcount; logs a warning if the parent has no entry (graceful degradation — child routes via least-loaded, loses prefix-cache locality but stays correct).

release_child_routing(parent_corr) decrements refcount; evicts the entry only when both ref_count <= 0 AND parent_final_seen.

subagent_type field

Added because the Weka kv-cache-tester trace schema already carries subagent_type: str per recorded subagent (real corpus values: "Explore", "Codex Subagent", etc. — free-form, set by upstream Claude Code proxy). Typing the new aiperf-side field as str \| None rather than a narrow enum lets loaders pass these through verbatim without filter/map logic and without rejecting new agent types as corpora evolve.

The field validator still enforces the SPAWN-only invariant (FORK children inherit the parent's role, so they have no separate classification).

Test plan

  • tests/unit/credit/test_sticky_router.py::TestStickyCreditRouterChildRoutingRefcount (6 tests): register/release refcount semantics, eviction conditions, missing-parent warning, no-op-on-unknown-parent.
  • tests/unit/credit/test_sticky_router.py::TestStickyCreditRouterSendCreditFinalTurnLifecycle (4 tests): single-turn root no-forks (no entry created), final turn with forks (entry survives for register_child_routing), DAG-child final turn (parent untouched), multi-turn final turn (entry popped on no-forks).
  • tests/unit/common/models/test_branch_model.py::TestConversationBranchInfoSubagentType (6 tests): SPAWN accepts "Explore" and arbitrary "Codex Subagent", FORK rejects non-None, round-trip.
  • Full unit suite: 12,689 passed, 79 skipped, 0 failed.
  • Component integration (DAG + sticky-routing): test_dag_adversarial_timing_modes, test_dag_combined_pathology, test_sticky_routing64 passed.
  • pre-commit run --files <touched> clean.
  • Real FORK-DAG benchmark run (any reviewer with hardware): verify the orchestrator no longer raises AttributeError on child dispatch.

Guardrails

StickyCreditRouter.send_credit grandfathered in tools/ergonomics_baseline.json (function-size 90 > 80) and tools/ruff_baseline.json (C901 cyclomatic 12 > 10, PLR0912 branches 13 > 12). The hot path is intentionally inlined per the class docstring; extracting helpers would add per-credit function-call overhead. Baseline additions restricted to exactly these two new entries; unrelated stale entries left untouched.

Related branches

  • ajc/dag4 (1 commit ahead of main, dated May) had an earlier version of this fix bundled with the Weka loader stack and a narrow SubagentType enum — appears abandoned, no merges from main since.
  • ajc/dag5 (17 commits ahead of main, actively being worked) is the successor of PR feat(dag): conversation DAG benchmarks (dag_jsonl) — FORK / SPAWN / f… #891 and still carries the same latent AttributeError. Happy to retarget this PR onto dag5 if you'd prefer to fold the fix into that review cycle.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for optional subagent_type specification in conversation branches for improved branch configuration.
    • Enhanced routing system with reference counting to better manage child session lifecycles in multi-branch conversations.
    • Introduced methods for managing child session routing registration and release.
  • Tests

    • Added validation tests for branch configuration constraints.
    • Expanded routing tests for complex session lifecycle scenarios.

Review Change Stack

… sticky entry

BranchOrchestrator already calls `self._sticky_router.register_child_routing(...)`
and `release_child_routing(...)` at five sites
(_branch_orchestrator_spawn.py:197,257, branch_orchestrator.py:391,455,467) but
those methods are never defined on `StickyCreditRouter`. Test suites mock them,
which is why CI is green, but the moment a real FORK-mode DAG runs in prod the
orchestrator raises `AttributeError` at the first child dispatch.

Sticky router

- `_sticky_sessions` changes from `dict[str, str]` to `dict[str, _StickyEntry]`.
  The new dataclass tracks `worker_id`, `ref_count`, and `parent_final_seen` so
  the parent's sticky binding can outlive its own final turn while DAG children
  pinning to that worker are still in-flight.
- `send_credit` resolves a `routing_key = parent_correlation_id or
  x_correlation_id` so DAG children pin to their parent's worker (was already
  the intent; nothing previously enforced it from the router side). Entry
  creation extends to "final turn that declares DAG spawns" so
  `register_child_routing` finds an entry to bump. Final-turn lifecycle decrements
  `ref_count` and flips `parent_final_seen`; eviction still requires both
  `ref_count <= 0` and `parent_final_seen`, and is suppressed when `has_forks`.
- `register_child_routing(parent_corr)`: increments refcount. If the parent has
  no entry, logs a warning and continues (child routes via least-loaded — loses
  prefix-cache locality but stays correct).
- `release_child_routing(parent_corr)`: decrements refcount; evicts only when
  both conditions above hold.

Hot-path note: the final-turn lifecycle block is gated by `sticky_entry is not
None`, not by a fallback `dict.get()`. The dominant single-turn-root-no-forks
workload (most AIPerf benchmarks) takes the same dict-lookup count as before —
no entry created, no second get, no allocation.

ConversationBranchInfo

- New `subagent_type: str | None` field. SPAWN-only (validator rejects on FORK).
  Free-form so upstream-recorded values like "Explore" or "Codex Subagent" pass
  through verbatim without enum-narrowing.

Tests

- TestStickyCreditRouterChildRoutingRefcount (6 tests): register/release
  refcount semantics, eviction conditions, missing-parent warning, idempotency.
- TestStickyCreditRouterSendCreditFinalTurnLifecycle (4 tests): single-turn root
  no-forks (no entry created), final turn with forks (entry survives for
  register_child_routing), DAG-child final turn (parent entry untouched),
  multi-turn final turn (entry popped on no-forks).
- TestConversationBranchInfoSubagentType (6 tests): SPAWN accepts "Explore" and
  arbitrary "Codex Subagent", FORK rejects non-None, round-trip.
- Existing tests updated to use `_StickyEntry(worker_id=...)` for direct dict
  pokes (test_sticky_router.py, test_race_conditions.py).

Guardrails

- `StickyCreditRouter.send_credit` grandfathered in tools/ergonomics_baseline.json
  (function-size 90 > 80) and tools/ruff_baseline.json (C901, PLR0912) — the
  hot path is intentionally inlined per the class docstring's "Methods are
  intentionally large/inlined" note. Baseline regenerations restricted to these
  two new entries; unrelated stale entries left untouched.

Verification: 12,689 unit tests pass + 64 DAG/sticky-routing component_integration
tests pass; pre-commit clean.

Related: official/ajc/dag4 (abandoned) had an earlier version of this fix bundled
with the Weka loader stack and a narrow SubagentType enum. official/ajc/dag5 is
the active DAG PR and still carries the same latent bug. If preferred, this can
be retargeted onto dag5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@github-actions github-actions Bot added the fix label May 20, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c53ea483b3d9e48fd006a82e68385a56108f5a6

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c53ea483b3d9e48fd006a82e68385a56108f5a6

Last updated for commit: 2c53ea4Browse code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Walkthrough

This PR introduces subagent-type metadata to conversation branching with SPAWN-mode validation, and refactors sticky-session state management to track refcounts and parent final-turn bookkeeping, enabling safe child routing past parent completion through new lifecycle methods.

Changes

Sticky Router Enhancement and Branch Model Metadata

Layer / File(s) Summary
Branch model subagent_type field and validation
src/aiperf/common/models/branch.py, tests/unit/common/models/test_branch_model.py
ConversationBranchInfo adds optional subagent_type field with field validator rejecting non-None values when mode is FORK. Test suite validates default handling, string acceptance, rejection for FORK, and model round-tripping.
Sticky entry data structure
src/aiperf/credit/sticky_router.py
Introduces _StickyEntry dataclass to track sticky-session ownership, ref count, and parent final-turn state for lifecycle-based eviction.
Sticky router storage refactoring
src/aiperf/credit/sticky_router.py, tests/unit/credit/test_sticky_router.py, tests/unit/timing/test_race_conditions.py
Replaces routing_key -> worker_id mapping with routing_key -> _StickyEntry. Updates routing lookup to derive keys from parent or session correlation ID. Test assertions and setup across multiple scenarios updated to reference _StickyEntry.worker_id.
Sticky router lifecycle and child routing methods
src/aiperf/credit/sticky_router.py
send_credit now creates/binds entries for non-final or fork scenarios, sets parent_final_seen and decrements refcount on final turns, and evicts only when refcount reaches zero with no forks. New register_child_routing and release_child_routing methods manage refcount for parent correlation IDs, evicting after parent final turn is observed.
Comprehensive sticky router refcount and final-turn lifecycle tests
tests/unit/credit/test_sticky_router.py
New test suite covering _StickyEntry refcount lifecycle for child routing including eviction conditions, and detailed send_credit final-turn paths for single-turn roots, fork scenarios, DAG child short-circuits, and multi-turn sessions.
Code quality baseline updates
tools/ergonomics_baseline.json, tools/ruff_baseline.json
Ergonomics and Ruff baselines record violations for expanded StickyCreditRouter.send_credit method (function-size, C901, PLR0912).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Poem

🐰 A branch now knows its type so fine,
Sticky routers count with designs divine,
Entries dance through parent and child,
Refcounts keep sessions reconciled,
DAG routing, finally unfiled!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main fix: defining missing register/release_child_routing methods to address a latent AttributeError in FORK DAGs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/credit/test_sticky_router.py (1)

48-48: 💤 Low value

Consider using next(iter()) for single-element access.

Static analysis suggests this minor optimization to avoid creating an intermediate list.

Suggested change
-        assert list(router._sticky_sessions.values())[0].worker_id == "worker-2"
+        assert next(iter(router._sticky_sessions.values())).worker_id == "worker-2"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/credit/test_sticky_router.py` at line 48, Replace the intermediate
list creation when asserting the single sticky session: instead of converting
router._sticky_sessions.values() to a list and indexing [0], use
next(iter(router._sticky_sessions.values())) to retrieve the first (and only)
element; update the assertion that checks its worker_id (the existing assertion
referencing router._sticky_sessions.values() and worker_id == "worker-2") to use
next(iter(...)) for more efficient single-element access.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/credit/test_sticky_router.py`:
- Line 48: Replace the intermediate list creation when asserting the single
sticky session: instead of converting router._sticky_sessions.values() to a list
and indexing [0], use next(iter(router._sticky_sessions.values())) to retrieve
the first (and only) element; update the assertion that checks its worker_id
(the existing assertion referencing router._sticky_sessions.values() and
worker_id == "worker-2") to use next(iter(...)) for more efficient
single-element access.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4cae323f-a088-49ab-8bf3-c12659dac25f

📥 Commits

Reviewing files that changed from the base of the PR and between bb6f421 and 541b93f.

📒 Files selected for processing (7)
  • src/aiperf/common/models/branch.py
  • src/aiperf/credit/sticky_router.py
  • tests/unit/common/models/test_branch_model.py
  • tests/unit/credit/test_sticky_router.py
  • tests/unit/timing/test_race_conditions.py
  • tools/ergonomics_baseline.json
  • tools/ruff_baseline.json

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 96.15385% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/credit/sticky_router.py 95.55% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

ajcasagrande added a commit that referenced this pull request May 20, 2026
…ion / orchestrator (PR-A)

Sits on top of 92d1342 (the foundation: CacheBustTarget enum +
build_cache_bust_marker). Plumbs the marker plumbing end-to-end without
actually injecting markers anywhere - injection at request-build time
is deferred to a follow-up that matches the source-of-truth split
(cjq/agentx-v0.3 injects at workers/worker.py::_apply_cache_bust, NOT
at compose time). This PR is the foundation a worker-layer follow-up
will plug into.

Config

- src/aiperf/config/dataset/content.py: new CacheBustConfig with
  ``target: CacheBustTarget = NONE`` sub-model on PromptConfig.
- src/aiperf/config/dataset/__init__.py: re-export CacheBustConfig.
- src/aiperf/config/schema/aiperf-config.schema.json: auto-regenerated.

Credit + session

- src/aiperf/credit/structs.py: Credit + CreditContext gain
  ``cache_bust_target: CacheBustTarget`` and ``cache_bust_marker:
  str | None`` fields with sensible defaults; msgspec ``omit_defaults``
  keeps the wire footprint unchanged when cache-bust is disabled.
- src/aiperf/timing/conversation_source.py: SampledSession carries
  the same fields. ``ConversationSource.start_branch_child`` and
  ``start_pre_session_child`` accept ``cache_bust_marker`` and
  ``cache_bust_target`` as keyword-only kwargs with NONE/None defaults
  so existing callers stay working.
- src/aiperf/common/models/record_models.py: RecordContext and
  MetricRecordInfo carry cache_bust_target for raw-JSONL export.

Orchestrator

- src/aiperf/timing/branch_orchestrator.py: ``_mint_child_marker
  (child_conversation_id) -> str | None`` method calls
  ``build_cache_bust_marker(benchmark_id, 0, 0, child_conversation_id,
  target=self._cache_bust_target)``. New ``cache_bust_target`` ``__init__``
  kwarg defaulting to NONE (PhaseRunner construction site wires the
  real value later - out of PR-A scope). Returns None when target=NONE.
- src/aiperf/timing/_branch_orchestrator_spawn.py: spawn path passes
  ``cache_bust_marker=self._mint_child_marker(child_conv_id)`` to
  ``start_branch_child`` unconditionally.

Architectural note

cjq/agentx-v0.3 injects cache-bust markers at the WORKER layer
(``workers/worker.py:_apply_cache_bust*`` lines 87/257/858/964) - the
composer there only does ISL-budget compensation for marker tokens
(``composer/base.py`` lines 130-175). This PR deliberately does not
add a compose-time injection path: a worker-layer follow-up will
implement the request-build-time injection that matches the source-of-
truth flow. PR-A's marker fields exist as fields on Credit/Session;
they are populated by the orchestrator (for SPAWN children) but the
worker that ultimately consumes them is not modified here.

Note on ScenarioSpec: an earlier draft added a minimal
``src/aiperf/common/scenario/{__init__.py,base.py}`` carrying a
``ScenarioSpec`` primitive with a ``require_cache_bust`` field. Removed
before merge - the data structure has no consumer in this PR (no
``_run_scenario_validator``, no scenario registry, no CLI surface), so
it would be dead code until PR-F (InferenceX AgentX-MVP scenario)
introduces an actual scenario that uses it. ScenarioSpec lands with PR-F.

Tests

- tests/unit/common/config/test_cache_bust_config.py - 19 tests for
  CacheBustConfig defaults, value validation, round-trip.
- tests/unit/credit/test_credit_cache_bust_fields.py - 10 tests for
  Credit + CreditContext field defaults, explicit values, msgspec
  wire round-trip with omit_defaults, ``TurnToSend.from_previous_credit``
  field propagation.
- tests/unit/timing/test_conversation_source_cache_bust.py - 7 tests
  covering SampledSession field plumbing, ``build_first_turn``
  propagation, and the keyword-only kwargs on the two start_*_child
  signatures.
- tests/unit/common/models/test_record_models_cache_bust.py - 7 tests
  for RecordContext + MetricRecordInfo defaults and orjson round-trip.
- tests/unit/timing/test_branch_orchestrator_cache_bust.py - 8 tests
  covering _mint_child_marker: NONE returns None, deterministic for
  same (benchmark_id, child_conversation_id, target), distinct
  per-child_conversation_id and per-benchmark_id.

Verification

- 12,755 unit tests pass + 79 skipped + 0 failed (up from 12,689 on
  the post-#966 baseline; +66 net new).
- pre-commit run --files <touched> clean (including auto-regenerated
  config schema and ruff format).

Cross-slice integration TODOs (for follow-up PRs)

1. PhaseRunner construction site needs to thread
   ``cache_bust_target=user_config.input.prompt.cache_bust.target`` and
   the real ``benchmark_id`` into BranchOrchestrator(). Until done,
   _mint_child_marker always returns None - safe default.
2. Worker-layer follow-up (out of PR-A scope) needs to port the
   ``_apply_cache_bust_to_system_message`` + ``_apply_cache_bust``
   logic from cjq/agentx-v0.3 (workers/worker.py:87, 257) and read
   ``credit.cache_bust_marker`` / ``credit.cache_bust_target`` to
   inject markers into the wire payload at request-build time.
3. The raw-record export path needs to copy
   ``credit.cache_bust_marker`` onto RecordContext + MetricRecordInfo
   so JSONL captures the rid for downstream analysis.
4. ScenarioSpec primitive lands with PR-F (the first scenario that
   uses it). That PR will add ``require_cache_bust`` plus the broader
   scenario-locking machinery.
5. CLI surface: ``--cache-bust`` flag not yet wired through the v2
   converter (``config/flags/_converter_dataset.py``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
x_correlation_id = credit.x_correlation_id
sticky_worker_id = self._sticky_sessions.get(x_correlation_id)
# DAG children pin to their parent's worker; otherwise pin to self.
routing_key = credit.parent_correlation_id or credit.x_correlation_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPAWN child credits also carry parent_correlation_id, so this unconditionally keys them under the parent while the final-turn cleanup skips all parent_correlation_id credits, leaking multi-turn SPAWN sticky sessions and forcing sibling SPAWN children onto one worker. Fix: only use parent_correlation_id and the refcount lifecycle for credit.branch_mode == ConversationBranchMode.FORK; route and clean up SPAWN children by their own x_correlation_id.

@ajcasagrande ajcasagrande added the qa-bugfix Important fix for QA bugs for release label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix qa-bugfix Important fix for QA bugs for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants