feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay by ajcasagrande · Pull Request #968 · ai-dynamo/aiperf

ajcasagrande · 2026-05-20T07:32:38Z

Summary

Adds the runtime coding-text content layer that completes the agentic-coding-trace pipeline. Main already ships the session structure synthesizer (src/aiperf/dataset/agentic_code_gen/, PRs #839/#943) — hash_ids + input/output lengths + delays in Mooncake-style JSONL — but no coding-text generator on the runtime side. Without this PR, sessions produced by agentic_code_gen are executed against the default Shakespeare PromptGenerator and produce unrepresentative MoE expert-routing patterns.

Why it matters (MoE, not tokenizers)

Mixture-of-Experts models route tokens to different experts based on content. Shakespeare routes to a narrow subset of experts (English-prose experts dominate). Real agentic-coding traffic activates the broader expert set hit by code, bash, JSON, error tracebacks, git diffs, configs, markdown, etc. So MoE serving metrics (per-expert load, hot/cold imbalance, throughput under realistic expert pressure) measured under Shakespeare are systematically biased away from coding-agent production. This PR replaces the corpus when an agentic-coding trace loader runs.

What's in the PR

src/aiperf/dataset/generator/coding_content.py — CodingContentGenerator (4082 lines: ~673 vocab tuples + ~1386 lines inside 81 triple-quoted code template literals + ~1.3k lines of generator scaffolding; the file is mostly corpus data, not logic).
src/aiperf/common/hash_id_random_generator.py — deterministic per-hash-id RNG used by the tool-pool sampler.
src/aiperf/dataset/generator/prompt.py — module-level sample_tokens_from_corpus() helper so CodingContentGenerator can build its pools without sharing PromptGenerator state.
PromptCorpus enum (sonnet / coding) in aiperf.common.enums.
corpus field on PromptConfig (drops the parent-prefix to match the existing isl/osl/block_size/batch_size convention on PromptConfig).
Flat CLI flag --prompt-corpus (cli_config.py:prompt_corpus) — stays prefixed to match the existing prompt_batch_size / prompt_sequence_distribution flat-CLI convention.
default_prompt_corpus field on Custom/PublicDatasetLoaderMetadata (declared as Literal["sonnet", "coding"] in schemas.py to dodge the aiperf.common.enums → aiperf.plugin chain circular import; mirrors PromptCorpus.value).
composer/custom.py wiring: when an is_trace loader runs and the effective corpus is CODING, swap in CodingContentGenerator instead of the default Shakespeare PromptGenerator. Reads corpus from self._synthetic_prompts when set; otherwise falls back to loader_metadata.default_prompt_corpus (which defaults to sonnet, preserving existing behavior).
tests/unit/dataset/generator/test_coding_content_generator.py — 76 unit tests covering init, generate(), token-sequence building, sampling, and per-language template families.

What's out of scope (deliberately deferred to follow-up PRs)

composer/public.py trace-loader wiring — main's PublicDatasetLoaderMetadata has no is_trace field; the branch's _inject_trace_kwargs depends on it. Adding is_trace to public-loader metadata is a separate, broader change.
Weka / semianalysis_cc_traces_weka_no_subagents HF loaders that would set default_prompt_corpus: coding. These are the consumers of this generator; they'll land alongside the Weka trace replay work.
The Weka-specific BPE-stable-terminator + hash-id-RNG threading additions to PromptGenerator that live on the source branch — those are Weka trace-replay concerns, not coding-content concerns.

Notes

Pure additive change. With default_prompt_corpus defaulting to sonnet everywhere and no loader on main setting coding, the new code path is opt-in via --prompt-corpus coding only. Behavior is unchanged unless the user (or a future loader plugin) requests it.
tools/ergonomics_baseline.json was regenerated to accept the corpus-as-data file/function sizes in coding_content.py. The 4082-line file is overwhelmingly hand-written templates and vocabulary tuples; splitting it doesn't make the data smaller and would obscure the per-template-family co-location.

Test plan

uv run pytest tests/unit/ -n auto — 12749 passed, 79 skipped
make validate-plugin-schemas — 33 categories, 219 plugins clean
make generate-all-plugin-files — schemas regenerated, no drift
pre-commit run — all hooks pass
Smoke-tested CodingContentGenerator(...).generate_prompt(num_tokens=64) against a real gpt2 tokenizer end-to-end
Reviewer to spot-check a representative sample of generated templates by language family (_gen_python_*, _gen_go_*, _gen_rust_*, _gen_ts_*, _gen_tool_*)

Summary by CodeRabbit

New Features
- Added --prompt-corpus CLI flag to choose synthetic prompt source (sonnet or coding)
- Added support for a “coding” prompt corpus and deterministic per-trace/hash prompt generation
Documentation
- Documented the new --prompt-corpus option and loader-default behavior
Schema
- Added plugin metadata field to declare a loader’s default prompt corpus
Tests
- Added comprehensive tests for coding prompt generation and composer behavior

…-trace replay Adds a token-pool prompt generator that produces structurally plausible, template-filled coding content (code, bash output, JSON, error tracebacks, git diffs, CI output, configs, markdown, test output, user prompts). The output is not real code — it's filler whose token distribution approximates real coding-agent traffic closely enough to drive realistic expert-routing patterns on Mixture-of-Experts models. PromptGenerator's Shakespeare corpus routes to a narrow subset of experts (English-prose experts dominate), which is a poor benchmark proxy for production coding workloads — coding-flavored content activates the broader set of experts that real agentic-coding traffic hits, so MoE serving metrics (per-expert load, hot/cold imbalance, throughput under realistic expert pressure) measured under this corpus are representative of agentic-coding production rather than literature. This completes #839/#943: main already ships the agentic_code_gen session SYNTHESIZER (hash_ids + input/output lengths + delays in Mooncake-style JSONL), but has no coding-text generator on the runtime side. Without this PR, synthesized sessions executed against a real model use Shakespeare filler and the resulting MoE expert-routing pattern is unrepresentative. Added: - src/aiperf/dataset/generator/coding_content.py: CodingContentGenerator (4082 lines; ~673 vocab tuples + ~1386 lines inside triple-quoted code template literals + ~1.3k lines of generator scaffolding) - src/aiperf/common/hash_id_random_generator.py: deterministic per-hash-id RNG used by the generator's tool-pool sampling - tests/unit/dataset/generator/test_coding_content_generator.py: 76 unit tests covering init, generate(), token sequence building, sampling, and per-language template families - PromptCorpus enum (sonnet/coding) in aiperf.common.enums - corpus field on PromptConfig with CLI flag --prompt-corpus (the Pydantic field drops the parent-prefix to match the existing isl/osl/block_size/ batch_size convention; the flat CLI field stays prompt_corpus to match prompt_batch_size / prompt_sequence_distribution) - default_prompt_corpus field on Custom/Public DatasetLoaderMetadata (declared as Literal["sonnet", "coding"] in schemas.py to avoid the aiperf.common.enums -> aiperf.plugin chain circular import) - sample_tokens_from_corpus() module-level helper on prompt.py (used by CodingContentGenerator to build its tool/text pools without sharing PromptGenerator state) - composer/custom.py wiring: when an is_trace loader runs and the effective corpus is CODING, swap in CodingContentGenerator instead of the default Shakespeare PromptGenerator. Reads corpus from the synthetic-dataset prompts config when present and falls back to the loader plugin's default_prompt_corpus. Not included (out of scope for this PR, can land separately): - composer/public.py trace-loader wiring (main's PublicDatasetLoaderMetadata has no is_trace field; branch's _inject_trace_kwargs depends on it) - Weka/semianalysis trace loaders that would set default_prompt_corpus=coding - The Weka-specific BPE-stable-terminator + hash-id RNG threading work on PromptGenerator (Weka trace replay concern, not coding-content) Verified: - pytest tests/unit/ -n auto: 12749 passed, 79 skipped - make validate-plugin-schemas: 33 categories, 219 plugins clean - make generate-all-plugin-files: schemas regenerated - pre-commit run: all hooks pass (ergonomics baseline regenerated to accept the corpus-as-data file/function sizes) Signed-off-by: Anthony Casagrande <[email protected]>

copy-pr-bot · 2026-05-20T07:32:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-20T07:32:47Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7e8edc5215229d2b8cccc73800d22854b4e6ba9e

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7e8edc5215229d2b8cccc73800d22854b4e6ba9e

Last updated for commit: 7e8edc5 • Browse code

github-actions · 2026-05-20T07:33:40Z

Fern Docs Preview: https://nvidia-preview-c8399d85-e8ca-439e-a700-62596e8f81af.docs.buildwithfern.com/aiperf/dev

coderabbitai · 2026-05-20T07:36:54Z

Walkthrough

This PR introduces a --prompt-corpus CLI option to select synthetic prompt corpus sources (sonnet or coding) for dataset generation. The change flows from CLI flags through configuration objects, plugin metadata defaults, dataset composer logic, and a token-sampling helper, and includes deterministic per-hash-ID RNG support and tests.

Changes

Prompt Corpus Selection and Dataset Generation

Layer / File(s)	Summary
PromptCorpus enum and type definitions `src/aiperf/common/enums/enums.py`, `src/aiperf/common/enums/__init__.py`, `src/aiperf/config/schema/aiperf-config.schema.json`	New `PromptCorpus` enum with `SONNET` and `CODING` members; re-exported via enums package; schema definition added to config.
CLI and configuration integration `src/aiperf/config/flags/cli_config.py`, `src/aiperf/config/dataset/content.py`, `src/aiperf/config/flags/_converter_dataset.py`, `src/aiperf/config/flags/_section_fields.py`, `docs/cli-options.md`	`CLIConfig` and `PromptConfig` gain `prompt_corpus` fields; converters and input field tracking updated; CLI docs describe corpus options and loader-default fallback behavior.
Plugin metadata schema for corpus defaults `src/aiperf/plugin/schema/schemas.py`, `src/aiperf/plugin/schema/plugins.schema.json`	Both JSON and Python schemas add `default_prompt_corpus` to loader metadata, specifying which corpus a plugin uses when not explicitly overridden.
HashIdRandomGenerator for deterministic trace processing `src/aiperf/common/hash_id_random_generator.py`	New class providing per-(trace_id, hash_id) deterministic random sequences via SHA-256 re-seeding, with NumPy RNG disabling.
Token sampling utility for corpus-based generation `src/aiperf/dataset/generator/prompt.py`	New `sample_tokens_from_corpus` function samples token sequences with optional separator prepending and wrap-around when span exceeds corpus length.
Trace dataset loader routing by corpus `src/aiperf/dataset/composer/custom.py`	CustomDatasetComposer conditionally initializes `CodingContentGenerator` for `CODING` corpus or uses existing prompt generator otherwise.
BaseDatasetComposer prompt-generator factory `src/aiperf/dataset/composer/base.py`	Adds `_create_prompt_generator` helper and updates `prompt_generator` typing/initialization to support `CodingContentGenerator` when corpus is `CODING`.
Synthetic composer unit tests `tests/unit/dataset/composer/test_synthetic_composer.py`	Tests verifying `CodingContentGenerator` selection and that coding-corpus options populate coding-specific prompt fields.
CodingContentGenerator unit tests `tests/unit/dataset/generator/test_coding_content_generator.py`	Comprehensive test suite covering initialization, generation, token sequences, caching, sampling, templates, language pools, conversation patterns, and seed determinism.
Ergonomics baseline updates `tools/ergonomics_baseline.json`	Baseline entries added for pre-existing code-quality violations in CodingContentGenerator and schema modules.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A corpus now flows through configs and CLI with ease,
With hash-IDs seeded for reproducible peace—
Sonnet or coding, the loader will choose,
Deterministic traces, no randomness to lose!
Tests verify pools and prompts from start to end.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.84% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main change: adding CodingContentGenerator to enable pseudo-realistic coding-trace replay.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/common/hash_id_random_generator.py`:
- Line 28: Add missing type hints: annotate the __getattr__ method signature as
def __getattr__(self, name: str) -> Any (import Any from typing if not already),
and annotate the constructor as def __init__(self, ...) -> None (add -> None to
its signature). Update imports to include Any when adding the __getattr__ return
type.
- Around line 48-49: from_base_rng incorrectly treats base_rng.seed == 0 as
falsy; change the logic in from_base_rng to check explicitly for None (e.g., use
"if base_rng.seed is not None then base_seed = base_rng.seed else base_seed =
base_rng.randrange(0, 2**64)") so a legitimate seed of 0 is preserved; update
the return path that constructs the instance (reference: class method
from_base_rng and attribute base_rng.seed) to use that explicit None check
instead of "or".

In `@src/aiperf/config/flags/_converter_dataset.py`:
- Around line 82-83: The CLI `--prompt-corpus` value is being added to the local
`prompts` dict (when `cli.prompt_corpus` is set) but later `_apply_dataset_type`
removes `prompts` for file datasets, making `--prompt-corpus` a no-op; update
`_apply_dataset_type` (or the dataset conversion flow that strips `prompts`) to
either preserve `prompts["corpus"]` for supported file/trace loaders (check the
loader type/format and pass `prompts` through to the file/trace loader
constructors) or validate early and raise a clear error when a provided
`cli.prompt_corpus` is incompatible with the chosen dataset type; locate
references to `prompts`, `cli.prompt_corpus`, and `_apply_dataset_type` in
_converter_dataset.py and implement the preservation or explicit rejection
accordingly.

In `@src/aiperf/dataset/generator/prompt.py`:
- Around line 53-60: The wrap logic may return fewer than num_tokens when
num_tokens > corpus_len because it only wraps once; change the logic in the
token-selection section (around start = rng_to_use.randrange(corpus_len), end =
start + num_tokens and the tokens.extend(...) branches) to guarantee exactly
num_tokens elements by iterating or using modulo arithmetic over corpus indices
(e.g., append corpus[(start + i) % corpus_len] for i in range(num_tokens) or
loop extending slices until len(tokens) == num_tokens) so tokens always reaches
the requested length regardless of num_tokens relative to corpus_len.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23f3a421-fa90-4ad9-aa86-e9e3f3a24bb3

📥 Commits

Reviewing files that changed from the base of the PR and between bb6f421 and efdba69.

📒 Files selected for processing (16)

docs/cli-options.md
src/aiperf/common/enums/__init__.py
src/aiperf/common/enums/enums.py
src/aiperf/common/hash_id_random_generator.py
src/aiperf/config/dataset/content.py
src/aiperf/config/flags/_converter_dataset.py
src/aiperf/config/flags/_section_fields.py
src/aiperf/config/flags/cli_config.py
src/aiperf/config/schema/aiperf-config.schema.json
src/aiperf/dataset/composer/custom.py
src/aiperf/dataset/generator/coding_content.py
src/aiperf/dataset/generator/prompt.py
src/aiperf/plugin/schema/plugins.schema.json
src/aiperf/plugin/schema/schemas.py
tests/unit/dataset/generator/test_coding_content_generator.py
tools/ergonomics_baseline.json

coderabbitai · 2026-05-20T07:36:57Z

+class _DisabledNumpyRNG:
+    """Raises on any attribute access to prevent NumPy RNG usage."""
+
+    def __getattr__(self, name):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n src/aiperf/common/hash_id_random_generator.py

Repository: ai-dynamo/aiperf

Length of output: 3480

Add missing function type hints.

__getattr__ on line 28 is missing type annotations for the name parameter and return type. __init__ on line 51 is missing the -> None return type annotation. Per coding guidelines, all functions require complete type hints (params and return).

Proposed fix

+from typing import Never + class _DisabledNumpyRNG: """Raises on any attribute access to prevent NumPy RNG usage.""" - def __getattr__(self, name): + def __getattr__(self, name: str) -> Never: raise RuntimeError( "HashIdRandomGenerator does not support NumPy RNG operations. " "Use Python RNG methods (randrange, choice, etc.) instead." ) class HashIdRandomGenerator(RandomGenerator): """RandomGenerator that re-seeds deterministically per (trace_id, hash_id). Designed for parallel processing where multiple workers need to generate identical content for the same hash_id within a trace file. Thread Safety: NOT thread-safe. Each worker process must have its own instance. """ `@classmethod` def from_base_rng(cls, base_rng: RandomGenerator) -> "HashIdRandomGenerator": """Create from a base RandomGenerator (typically from rng.derive()).""" base_seed = base_rng.seed or base_rng.randrange(0, 2**64) return cls(base_seed, _internal=True) - def __init__(self, base_seed: int, *, _internal: bool = False): + def __init__(self, base_seed: int, *, _internal: bool = False) -> None: super().__init__(base_seed, _internal=_internal) self._numpy_rng = _DisabledNumpyRNG() self._trace_id: str = ""

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/common/hash_id_random_generator.py` at line 28, Add missing type hints: annotate the __getattr__ method signature as def __getattr__(self, name: str) -> Any (import Any from typing if not already), and annotate the constructor as def __init__(self, ...) -> None (add -> None to its signature). Update imports to include Any when adding the __getattr__ return type.

coderabbitai · 2026-05-20T07:36:57Z

+        base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
+        return cls(base_seed, _internal=True)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' seed = 0 fallback = 123 print(seed or fallback) # demonstrates current behavior picks fallback instead of 0 PY

Repository: ai-dynamo/aiperf

Length of output: 64

🏁 Script executed:

cat -n src/aiperf/common/hash_id_random_generator.py

Repository: ai-dynamo/aiperf

Length of output: 3480

🏁 Script executed:

rg "class RandomGenerator" -A 20 src/aiperf/

Repository: ai-dynamo/aiperf

Length of output: 1811

🏁 Script executed:

rg "self\.seed" src/aiperf/common/random_generator.py -B 2 -A 2

Repository: ai-dynamo/aiperf

Length of output: 42

🏁 Script executed:

cat -n src/aiperf/common/random_generator.py | head -80

Repository: ai-dynamo/aiperf

Length of output: 3380

🏁 Script executed:

cat -n src/aiperf/common/random_generator.py | sed -n '81,150p'

Repository: ai-dynamo/aiperf

Length of output: 3100

🏁 Script executed:

cat -n src/aiperf/common/random_generator.py | sed -n '118,122p'

Repository: ai-dynamo/aiperf

Length of output: 232

Fix seed value 0 being treated as falsy in from_base_rng.

Line 48 uses base_rng.seed or base_rng.randrange(0, 2**64), which treats seed value 0 as falsy. Since 0 is documented as a valid seed (per RandomGenerator's docstring: "Optional random seed (0 to 2^64-1)"), this breaks reproducibility when seed is legitimately 0. Replace the or operator with an explicit is not None check.

Proposed fix

- base_seed = base_rng.seed or base_rng.randrange(0, 2**64) + base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

base_seed = base_rng.seed or base_rng.randrange(0, 2**64)

return cls(base_seed, _internal=True)

base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64)

return cls(base_seed, _internal=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/common/hash_id_random_generator.py` around lines 48 - 49, from_base_rng incorrectly treats base_rng.seed == 0 as falsy; change the logic in from_base_rng to check explicitly for None (e.g., use "if base_rng.seed is not None then base_seed = base_rng.seed else base_seed = base_rng.randrange(0, 2**64)") so a legitimate seed of 0 is preserved; update the return path that constructs the instance (reference: class method from_base_rng and attribute base_rng.seed) to use that explicit None check instead of "or".

coderabbitai · 2026-05-20T07:36:57Z

+    if "prompt_corpus" in s and cli.prompt_corpus is not None:
+        prompts["corpus"] = cli.prompt_corpus


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--prompt-corpus is silently discarded for file-dataset paths

Line [82] stores prompt_corpus under prompts, but file datasets later remove prompts in _apply_dataset_type (Lines [330]-[338]). That makes --prompt-corpus a no-op for --input-file flows instead of applying or failing fast. Please preserve corpus for supported file/trace loaders (or explicitly reject it for unsupported file formats) rather than dropping it silently.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/config/flags/_converter_dataset.py` around lines 82 - 83, The CLI `--prompt-corpus` value is being added to the local `prompts` dict (when `cli.prompt_corpus` is set) but later `_apply_dataset_type` removes `prompts` for file datasets, making `--prompt-corpus` a no-op; update `_apply_dataset_type` (or the dataset conversion flow that strips `prompts`) to either preserve `prompts["corpus"]` for supported file/trace loaders (check the loader type/format and pass `prompts` through to the file/trace loader constructors) or validate early and raise a clear error when a provided `cli.prompt_corpus` is incompatible with the chosen dataset type; locate references to `prompts`, `cli.prompt_corpus`, and `_apply_dataset_type` in _converter_dataset.py and implement the preservation or explicit rejection accordingly.

coderabbitai · 2026-05-20T07:36:57Z

+    start = rng_to_use.randrange(corpus_len)
+    end = start + num_tokens
+
+    if end <= corpus_len:
+        tokens.extend(corpus[start:end])
+    else:
+        tokens.extend(corpus[start:])
+        tokens.extend(corpus[: end - corpus_len])


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix wrap logic to always return the requested token count.

The current branch only wraps once. For larger num_tokens, it can return fewer tokens than requested, which breaks the helper’s “exact length” contract.

Suggested fix

- start = rng_to_use.randrange(corpus_len) - end = start + num_tokens - - if end <= corpus_len: - tokens.extend(corpus[start:end]) - else: - tokens.extend(corpus[start:]) - tokens.extend(corpus[: end - corpus_len]) + if num_tokens <= 0: + return tokens + if corpus_len == 0: + raise ValueError("corpus must be non-empty when num_tokens > 0") + + start = rng_to_use.randrange(corpus_len) + tokens.extend( + corpus[(start + i) % corpus_len] + for i in range(num_tokens) + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/dataset/generator/prompt.py` around lines 53 - 60, The wrap logic may return fewer than num_tokens when num_tokens > corpus_len because it only wraps once; change the logic in the token-selection section (around start = rng_to_use.randrange(corpus_len), end = start + num_tokens and the tokens.extend(...) branches) to guarantee exactly num_tokens elements by iterating or using modulo arithmetic over corpus indices (e.g., append corpus[(start + i) % corpus_len] for i in range(num_tokens) or loop extending slices until len(tokens) == num_tokens) so tokens always reaches the requested length regardless of num_tokens relative to corpus_len.

codecov · 2026-05-20T07:42:06Z

Codecov Report

❌ Patch coverage is 83.05085% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/dataset/composer/custom.py	50.00%	3 Missing and 1 partial ⚠️
src/aiperf/dataset/generator/prompt.py	69.23%	2 Missing and 2 partials ⚠️
src/aiperf/common/hash_id_random_generator.py	90.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Route synthetic non-trace prompt generation through CodingContentGenerator when requested so MoE-oriented coding content applies beyond trace replay. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Anthony Casagrande <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/dataset/composer/test_synthetic_composer.py`:
- Line 33: The test names don't follow the required
test_<function>_<scenario>_<expected> convention; rename the function
test_initialization_with_coding_corpus and the other test mentioned at line 46
to follow that pattern (for example
test_initialization_with_coding_corpus_success or
test_initialization_with_coding_corpus_validates_tokenization) so they
explicitly state the function under test, the scenario, and the expected
outcome; update any references/imports in the test module (e.g., the class
containing these test methods and usages of mock_tokenizer) to use the new
names.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dfe021d5-aa5d-473c-b2b9-5f40ebb304ad

📥 Commits

Reviewing files that changed from the base of the PR and between efdba69 and 7e8edc5.

📒 Files selected for processing (3)

src/aiperf/dataset/composer/base.py
src/aiperf/dataset/generator/coding_content.py
tests/unit/dataset/composer/test_synthetic_composer.py

coderabbitai · 2026-05-22T06:37:28Z

        assert composer.include_image is False
        assert composer.include_audio is False

+    def test_initialization_with_coding_corpus(self, mock_tokenizer):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename these tests to match the required naming contract.

Line 33 and Line 46 do not follow the test_<function>_<scenario>_<expected> convention.

Proposed rename

- def test_initialization_with_coding_corpus(self, mock_tokenizer): + def test_initialization_coding_corpus_uses_coding_content_generator(self, mock_tokenizer): @@ - def test_coding_corpus_generates_context_prompts(self, mock_tokenizer): + def test_create_dataset_coding_corpus_populates_context_prompts(self, mock_tokenizer):

As per coding guidelines: "Test naming convention: test_<function>_<scenario>_<expected>".

Also applies to: 46-46

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/dataset/composer/test_synthetic_composer.py` at line 33, The test names don't follow the required test_<function>_<scenario>_<expected> convention; rename the function test_initialization_with_coding_corpus and the other test mentioned at line 46 to follow that pattern (for example test_initialization_with_coding_corpus_success or test_initialization_with_coding_corpus_validates_tokenization) so they explicitly state the function under test, the scenario, and the expected outcome; update any references/imports in the test module (e.g., the class containing these test methods and usages of mock_tokenizer) to use the new names.

github-actions Bot added the feat label May 20, 2026

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

ajcasagrande added the AgentX Feature for AgentX label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay#968

feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay#968
ajcasagrande wants to merge 2 commits into
mainfrom
ajc/coding-content-generator

ajcasagrande commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
		return cls(base_seed, _internal=True)

		if "prompt_corpus" in s and cli.prompt_corpus is not None:
		prompts["corpus"] = cli.prompt_corpus

Conversation

ajcasagrande commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why it matters (MoE, not tokenizers)

What's in the PR

What's out of scope (deliberately deferred to follow-up PRs)

Notes

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajcasagrande commented May 20, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 20, 2026 •

edited

Loading

github-actions Bot commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading