feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay#968
feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay#968ajcasagrande wants to merge 2 commits into
Conversation
…-trace replay Adds a token-pool prompt generator that produces structurally plausible, template-filled coding content (code, bash output, JSON, error tracebacks, git diffs, CI output, configs, markdown, test output, user prompts). The output is not real code — it's filler whose token distribution approximates real coding-agent traffic closely enough to drive realistic expert-routing patterns on Mixture-of-Experts models. PromptGenerator's Shakespeare corpus routes to a narrow subset of experts (English-prose experts dominate), which is a poor benchmark proxy for production coding workloads — coding-flavored content activates the broader set of experts that real agentic-coding traffic hits, so MoE serving metrics (per-expert load, hot/cold imbalance, throughput under realistic expert pressure) measured under this corpus are representative of agentic-coding production rather than literature. This completes #839/#943: main already ships the agentic_code_gen session SYNTHESIZER (hash_ids + input/output lengths + delays in Mooncake-style JSONL), but has no coding-text generator on the runtime side. Without this PR, synthesized sessions executed against a real model use Shakespeare filler and the resulting MoE expert-routing pattern is unrepresentative. Added: - src/aiperf/dataset/generator/coding_content.py: CodingContentGenerator (4082 lines; ~673 vocab tuples + ~1386 lines inside triple-quoted code template literals + ~1.3k lines of generator scaffolding) - src/aiperf/common/hash_id_random_generator.py: deterministic per-hash-id RNG used by the generator's tool-pool sampling - tests/unit/dataset/generator/test_coding_content_generator.py: 76 unit tests covering init, generate(), token sequence building, sampling, and per-language template families - PromptCorpus enum (sonnet/coding) in aiperf.common.enums - corpus field on PromptConfig with CLI flag --prompt-corpus (the Pydantic field drops the parent-prefix to match the existing isl/osl/block_size/ batch_size convention; the flat CLI field stays prompt_corpus to match prompt_batch_size / prompt_sequence_distribution) - default_prompt_corpus field on Custom/Public DatasetLoaderMetadata (declared as Literal["sonnet", "coding"] in schemas.py to avoid the aiperf.common.enums -> aiperf.plugin chain circular import) - sample_tokens_from_corpus() module-level helper on prompt.py (used by CodingContentGenerator to build its tool/text pools without sharing PromptGenerator state) - composer/custom.py wiring: when an is_trace loader runs and the effective corpus is CODING, swap in CodingContentGenerator instead of the default Shakespeare PromptGenerator. Reads corpus from the synthetic-dataset prompts config when present and falls back to the loader plugin's default_prompt_corpus. Not included (out of scope for this PR, can land separately): - composer/public.py trace-loader wiring (main's PublicDatasetLoaderMetadata has no is_trace field; branch's _inject_trace_kwargs depends on it) - Weka/semianalysis trace loaders that would set default_prompt_corpus=coding - The Weka-specific BPE-stable-terminator + hash-id RNG threading work on PromptGenerator (Weka trace replay concern, not coding-content) Verified: - pytest tests/unit/ -n auto: 12749 passed, 79 skipped - make validate-plugin-schemas: 33 categories, 219 plugins clean - make generate-all-plugin-files: schemas regenerated - pre-commit run: all hooks pass (ergonomics baseline regenerated to accept the corpus-as-data file/function sizes) Signed-off-by: Anthony Casagrande <[email protected]>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7e8edc5215229d2b8cccc73800d22854b4e6ba9eRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7e8edc5215229d2b8cccc73800d22854b4e6ba9eLast updated for commit: |
WalkthroughThis PR introduces a ChangesPrompt Corpus Selection and Dataset Generation
🎯 2 (Simple) | ⏱️ ~12 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/common/hash_id_random_generator.py`:
- Line 28: Add missing type hints: annotate the __getattr__ method signature as
def __getattr__(self, name: str) -> Any (import Any from typing if not already),
and annotate the constructor as def __init__(self, ...) -> None (add -> None to
its signature). Update imports to include Any when adding the __getattr__ return
type.
- Around line 48-49: from_base_rng incorrectly treats base_rng.seed == 0 as
falsy; change the logic in from_base_rng to check explicitly for None (e.g., use
"if base_rng.seed is not None then base_seed = base_rng.seed else base_seed =
base_rng.randrange(0, 2**64)") so a legitimate seed of 0 is preserved; update
the return path that constructs the instance (reference: class method
from_base_rng and attribute base_rng.seed) to use that explicit None check
instead of "or".
In `@src/aiperf/config/flags/_converter_dataset.py`:
- Around line 82-83: The CLI `--prompt-corpus` value is being added to the local
`prompts` dict (when `cli.prompt_corpus` is set) but later `_apply_dataset_type`
removes `prompts` for file datasets, making `--prompt-corpus` a no-op; update
`_apply_dataset_type` (or the dataset conversion flow that strips `prompts`) to
either preserve `prompts["corpus"]` for supported file/trace loaders (check the
loader type/format and pass `prompts` through to the file/trace loader
constructors) or validate early and raise a clear error when a provided
`cli.prompt_corpus` is incompatible with the chosen dataset type; locate
references to `prompts`, `cli.prompt_corpus`, and `_apply_dataset_type` in
_converter_dataset.py and implement the preservation or explicit rejection
accordingly.
In `@src/aiperf/dataset/generator/prompt.py`:
- Around line 53-60: The wrap logic may return fewer than num_tokens when
num_tokens > corpus_len because it only wraps once; change the logic in the
token-selection section (around start = rng_to_use.randrange(corpus_len), end =
start + num_tokens and the tokens.extend(...) branches) to guarantee exactly
num_tokens elements by iterating or using modulo arithmetic over corpus indices
(e.g., append corpus[(start + i) % corpus_len] for i in range(num_tokens) or
loop extending slices until len(tokens) == num_tokens) so tokens always reaches
the requested length regardless of num_tokens relative to corpus_len.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 23f3a421-fa90-4ad9-aa86-e9e3f3a24bb3
📒 Files selected for processing (16)
docs/cli-options.mdsrc/aiperf/common/enums/__init__.pysrc/aiperf/common/enums/enums.pysrc/aiperf/common/hash_id_random_generator.pysrc/aiperf/config/dataset/content.pysrc/aiperf/config/flags/_converter_dataset.pysrc/aiperf/config/flags/_section_fields.pysrc/aiperf/config/flags/cli_config.pysrc/aiperf/config/schema/aiperf-config.schema.jsonsrc/aiperf/dataset/composer/custom.pysrc/aiperf/dataset/generator/coding_content.pysrc/aiperf/dataset/generator/prompt.pysrc/aiperf/plugin/schema/plugins.schema.jsonsrc/aiperf/plugin/schema/schemas.pytests/unit/dataset/generator/test_coding_content_generator.pytools/ergonomics_baseline.json
| class _DisabledNumpyRNG: | ||
| """Raises on any attribute access to prevent NumPy RNG usage.""" | ||
|
|
||
| def __getattr__(self, name): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/aiperf/common/hash_id_random_generator.pyRepository: ai-dynamo/aiperf
Length of output: 3480
Add missing function type hints.
__getattr__ on line 28 is missing type annotations for the name parameter and return type. __init__ on line 51 is missing the -> None return type annotation. Per coding guidelines, all functions require complete type hints (params and return).
Proposed fix
+from typing import Never
+
class _DisabledNumpyRNG:
"""Raises on any attribute access to prevent NumPy RNG usage."""
- def __getattr__(self, name):
+ def __getattr__(self, name: str) -> Never:
raise RuntimeError(
"HashIdRandomGenerator does not support NumPy RNG operations. "
"Use Python RNG methods (randrange, choice, etc.) instead."
)
class HashIdRandomGenerator(RandomGenerator):
"""RandomGenerator that re-seeds deterministically per (trace_id, hash_id).
Designed for parallel processing where multiple workers need to generate
identical content for the same hash_id within a trace file.
Thread Safety:
NOT thread-safe. Each worker process must have its own instance.
"""
`@classmethod`
def from_base_rng(cls, base_rng: RandomGenerator) -> "HashIdRandomGenerator":
"""Create from a base RandomGenerator (typically from rng.derive())."""
base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
return cls(base_seed, _internal=True)
- def __init__(self, base_seed: int, *, _internal: bool = False):
+ def __init__(self, base_seed: int, *, _internal: bool = False) -> None:
super().__init__(base_seed, _internal=_internal)
self._numpy_rng = _DisabledNumpyRNG()
self._trace_id: str = ""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/common/hash_id_random_generator.py` at line 28, Add missing type
hints: annotate the __getattr__ method signature as def __getattr__(self, name:
str) -> Any (import Any from typing if not already), and annotate the
constructor as def __init__(self, ...) -> None (add -> None to its signature).
Update imports to include Any when adding the __getattr__ return type.
| base_seed = base_rng.seed or base_rng.randrange(0, 2**64) | ||
| return cls(base_seed, _internal=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
seed = 0
fallback = 123
print(seed or fallback) # demonstrates current behavior picks fallback instead of 0
PYRepository: ai-dynamo/aiperf
Length of output: 64
🏁 Script executed:
cat -n src/aiperf/common/hash_id_random_generator.pyRepository: ai-dynamo/aiperf
Length of output: 3480
🏁 Script executed:
rg "class RandomGenerator" -A 20 src/aiperf/Repository: ai-dynamo/aiperf
Length of output: 1811
🏁 Script executed:
rg "self\.seed" src/aiperf/common/random_generator.py -B 2 -A 2Repository: ai-dynamo/aiperf
Length of output: 42
🏁 Script executed:
cat -n src/aiperf/common/random_generator.py | head -80Repository: ai-dynamo/aiperf
Length of output: 3380
🏁 Script executed:
cat -n src/aiperf/common/random_generator.py | sed -n '81,150p'Repository: ai-dynamo/aiperf
Length of output: 3100
🏁 Script executed:
cat -n src/aiperf/common/random_generator.py | sed -n '118,122p'Repository: ai-dynamo/aiperf
Length of output: 232
Fix seed value 0 being treated as falsy in from_base_rng.
Line 48 uses base_rng.seed or base_rng.randrange(0, 2**64), which treats seed value 0 as falsy. Since 0 is documented as a valid seed (per RandomGenerator's docstring: "Optional random seed (0 to 2^64-1)"), this breaks reproducibility when seed is legitimately 0. Replace the or operator with an explicit is not None check.
Proposed fix
- base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
+ base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| base_seed = base_rng.seed or base_rng.randrange(0, 2**64) | |
| return cls(base_seed, _internal=True) | |
| base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64) | |
| return cls(base_seed, _internal=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/common/hash_id_random_generator.py` around lines 48 - 49,
from_base_rng incorrectly treats base_rng.seed == 0 as falsy; change the logic
in from_base_rng to check explicitly for None (e.g., use "if base_rng.seed is
not None then base_seed = base_rng.seed else base_seed = base_rng.randrange(0,
2**64)") so a legitimate seed of 0 is preserved; update the return path that
constructs the instance (reference: class method from_base_rng and attribute
base_rng.seed) to use that explicit None check instead of "or".
| if "prompt_corpus" in s and cli.prompt_corpus is not None: | ||
| prompts["corpus"] = cli.prompt_corpus |
There was a problem hiding this comment.
--prompt-corpus is silently discarded for file-dataset paths
Line [82] stores prompt_corpus under prompts, but file datasets later remove prompts in _apply_dataset_type (Lines [330]-[338]). That makes --prompt-corpus a no-op for --input-file flows instead of applying or failing fast. Please preserve corpus for supported file/trace loaders (or explicitly reject it for unsupported file formats) rather than dropping it silently.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/config/flags/_converter_dataset.py` around lines 82 - 83, The CLI
`--prompt-corpus` value is being added to the local `prompts` dict (when
`cli.prompt_corpus` is set) but later `_apply_dataset_type` removes `prompts`
for file datasets, making `--prompt-corpus` a no-op; update
`_apply_dataset_type` (or the dataset conversion flow that strips `prompts`) to
either preserve `prompts["corpus"]` for supported file/trace loaders (check the
loader type/format and pass `prompts` through to the file/trace loader
constructors) or validate early and raise a clear error when a provided
`cli.prompt_corpus` is incompatible with the chosen dataset type; locate
references to `prompts`, `cli.prompt_corpus`, and `_apply_dataset_type` in
_converter_dataset.py and implement the preservation or explicit rejection
accordingly.
| start = rng_to_use.randrange(corpus_len) | ||
| end = start + num_tokens | ||
|
|
||
| if end <= corpus_len: | ||
| tokens.extend(corpus[start:end]) | ||
| else: | ||
| tokens.extend(corpus[start:]) | ||
| tokens.extend(corpus[: end - corpus_len]) |
There was a problem hiding this comment.
Fix wrap logic to always return the requested token count.
The current branch only wraps once. For larger num_tokens, it can return fewer tokens than requested, which breaks the helper’s “exact length” contract.
Suggested fix
- start = rng_to_use.randrange(corpus_len)
- end = start + num_tokens
-
- if end <= corpus_len:
- tokens.extend(corpus[start:end])
- else:
- tokens.extend(corpus[start:])
- tokens.extend(corpus[: end - corpus_len])
+ if num_tokens <= 0:
+ return tokens
+ if corpus_len == 0:
+ raise ValueError("corpus must be non-empty when num_tokens > 0")
+
+ start = rng_to_use.randrange(corpus_len)
+ tokens.extend(
+ corpus[(start + i) % corpus_len]
+ for i in range(num_tokens)
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/dataset/generator/prompt.py` around lines 53 - 60, The wrap logic
may return fewer than num_tokens when num_tokens > corpus_len because it only
wraps once; change the logic in the token-selection section (around start =
rng_to_use.randrange(corpus_len), end = start + num_tokens and the
tokens.extend(...) branches) to guarantee exactly num_tokens elements by
iterating or using modulo arithmetic over corpus indices (e.g., append
corpus[(start + i) % corpus_len] for i in range(num_tokens) or loop extending
slices until len(tokens) == num_tokens) so tokens always reaches the requested
length regardless of num_tokens relative to corpus_len.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Route synthetic non-trace prompt generation through CodingContentGenerator when requested so MoE-oriented coding content applies beyond trace replay. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Anthony Casagrande <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/dataset/composer/test_synthetic_composer.py`:
- Line 33: The test names don't follow the required
test_<function>_<scenario>_<expected> convention; rename the function
test_initialization_with_coding_corpus and the other test mentioned at line 46
to follow that pattern (for example
test_initialization_with_coding_corpus_success or
test_initialization_with_coding_corpus_validates_tokenization) so they
explicitly state the function under test, the scenario, and the expected
outcome; update any references/imports in the test module (e.g., the class
containing these test methods and usages of mock_tokenizer) to use the new
names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dfe021d5-aa5d-473c-b2b9-5f40ebb304ad
📒 Files selected for processing (3)
src/aiperf/dataset/composer/base.pysrc/aiperf/dataset/generator/coding_content.pytests/unit/dataset/composer/test_synthetic_composer.py
| assert composer.include_image is False | ||
| assert composer.include_audio is False | ||
|
|
||
| def test_initialization_with_coding_corpus(self, mock_tokenizer): |
There was a problem hiding this comment.
Rename these tests to match the required naming contract.
Line 33 and Line 46 do not follow the test_<function>_<scenario>_<expected> convention.
Proposed rename
- def test_initialization_with_coding_corpus(self, mock_tokenizer):
+ def test_initialization_coding_corpus_uses_coding_content_generator(self, mock_tokenizer):
@@
- def test_coding_corpus_generates_context_prompts(self, mock_tokenizer):
+ def test_create_dataset_coding_corpus_populates_context_prompts(self, mock_tokenizer):As per coding guidelines: "Test naming convention: test_<function>_<scenario>_<expected>".
Also applies to: 46-46
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/dataset/composer/test_synthetic_composer.py` at line 33, The test
names don't follow the required test_<function>_<scenario>_<expected>
convention; rename the function test_initialization_with_coding_corpus and the
other test mentioned at line 46 to follow that pattern (for example
test_initialization_with_coding_corpus_success or
test_initialization_with_coding_corpus_validates_tokenization) so they
explicitly state the function under test, the scenario, and the expected
outcome; update any references/imports in the test module (e.g., the class
containing these test methods and usages of mock_tokenizer) to use the new
names.
Summary
Adds the runtime coding-text content layer that completes the agentic-coding-trace pipeline. Main already ships the session structure synthesizer (
src/aiperf/dataset/agentic_code_gen/, PRs #839/#943) — hash_ids + input/output lengths + delays in Mooncake-style JSONL — but no coding-text generator on the runtime side. Without this PR, sessions produced byagentic_code_genare executed against the default ShakespearePromptGeneratorand produce unrepresentative MoE expert-routing patterns.Why it matters (MoE, not tokenizers)
Mixture-of-Experts models route tokens to different experts based on content. Shakespeare routes to a narrow subset of experts (English-prose experts dominate). Real agentic-coding traffic activates the broader expert set hit by code, bash, JSON, error tracebacks, git diffs, configs, markdown, etc. So MoE serving metrics (per-expert load, hot/cold imbalance, throughput under realistic expert pressure) measured under Shakespeare are systematically biased away from coding-agent production. This PR replaces the corpus when an agentic-coding trace loader runs.
What's in the PR
src/aiperf/dataset/generator/coding_content.py—CodingContentGenerator(4082 lines: ~673 vocab tuples + ~1386 lines inside 81 triple-quoted code template literals + ~1.3k lines of generator scaffolding; the file is mostly corpus data, not logic).src/aiperf/common/hash_id_random_generator.py— deterministic per-hash-id RNG used by the tool-pool sampler.src/aiperf/dataset/generator/prompt.py— module-levelsample_tokens_from_corpus()helper soCodingContentGeneratorcan build its pools without sharingPromptGeneratorstate.PromptCorpusenum (sonnet/coding) inaiperf.common.enums.corpusfield onPromptConfig(drops the parent-prefix to match the existingisl/osl/block_size/batch_sizeconvention onPromptConfig).--prompt-corpus(cli_config.py:prompt_corpus) — stays prefixed to match the existingprompt_batch_size/prompt_sequence_distributionflat-CLI convention.default_prompt_corpusfield onCustom/PublicDatasetLoaderMetadata(declared asLiteral["sonnet", "coding"]inschemas.pyto dodge theaiperf.common.enums→aiperf.pluginchain circular import; mirrorsPromptCorpus.value).composer/custom.pywiring: when anis_traceloader runs and the effective corpus isCODING, swap inCodingContentGeneratorinstead of the default ShakespearePromptGenerator. Readscorpusfromself._synthetic_promptswhen set; otherwise falls back toloader_metadata.default_prompt_corpus(which defaults tosonnet, preserving existing behavior).tests/unit/dataset/generator/test_coding_content_generator.py— 76 unit tests covering init, generate(), token-sequence building, sampling, and per-language template families.What's out of scope (deliberately deferred to follow-up PRs)
composer/public.pytrace-loader wiring — main'sPublicDatasetLoaderMetadatahas nois_tracefield; the branch's_inject_trace_kwargsdepends on it. Addingis_traceto public-loader metadata is a separate, broader change.semianalysis_cc_traces_weka_no_subagentsHF loaders that would setdefault_prompt_corpus: coding. These are the consumers of this generator; they'll land alongside the Weka trace replay work.PromptGeneratorthat live on the source branch — those are Weka trace-replay concerns, not coding-content concerns.Notes
default_prompt_corpusdefaulting tosonneteverywhere and no loader onmainsettingcoding, the new code path is opt-in via--prompt-corpus codingonly. Behavior is unchanged unless the user (or a future loader plugin) requests it.tools/ergonomics_baseline.jsonwas regenerated to accept the corpus-as-data file/function sizes incoding_content.py. The 4082-line file is overwhelmingly hand-written templates and vocabulary tuples; splitting it doesn't make the data smaller and would obscure the per-template-family co-location.Test plan
uv run pytest tests/unit/ -n auto— 12749 passed, 79 skippedmake validate-plugin-schemas— 33 categories, 219 plugins cleanmake generate-all-plugin-files— schemas regenerated, no driftpre-commit run— all hooks passCodingContentGenerator(...).generate_prompt(num_tokens=64)against a realgpt2tokenizer end-to-end_gen_python_*,_gen_go_*,_gen_rust_*,_gen_ts_*,_gen_tool_*)Summary by CodeRabbit
New Features
--prompt-corpusCLI flag to choose synthetic prompt source (sonnetorcoding)Documentation
--prompt-corpusoption and loader-default behaviorSchema
Tests