Skip to content

feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay#968

Open
ajcasagrande wants to merge 2 commits into
mainfrom
ajc/coding-content-generator
Open

feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay#968
ajcasagrande wants to merge 2 commits into
mainfrom
ajc/coding-content-generator

Conversation

@ajcasagrande
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande commented May 20, 2026

Summary

Adds the runtime coding-text content layer that completes the agentic-coding-trace pipeline. Main already ships the session structure synthesizer (src/aiperf/dataset/agentic_code_gen/, PRs #839/#943) — hash_ids + input/output lengths + delays in Mooncake-style JSONL — but no coding-text generator on the runtime side. Without this PR, sessions produced by agentic_code_gen are executed against the default Shakespeare PromptGenerator and produce unrepresentative MoE expert-routing patterns.

Why it matters (MoE, not tokenizers)

Mixture-of-Experts models route tokens to different experts based on content. Shakespeare routes to a narrow subset of experts (English-prose experts dominate). Real agentic-coding traffic activates the broader expert set hit by code, bash, JSON, error tracebacks, git diffs, configs, markdown, etc. So MoE serving metrics (per-expert load, hot/cold imbalance, throughput under realistic expert pressure) measured under Shakespeare are systematically biased away from coding-agent production. This PR replaces the corpus when an agentic-coding trace loader runs.

What's in the PR

  • src/aiperf/dataset/generator/coding_content.pyCodingContentGenerator (4082 lines: ~673 vocab tuples + ~1386 lines inside 81 triple-quoted code template literals + ~1.3k lines of generator scaffolding; the file is mostly corpus data, not logic).
  • src/aiperf/common/hash_id_random_generator.py — deterministic per-hash-id RNG used by the tool-pool sampler.
  • src/aiperf/dataset/generator/prompt.py — module-level sample_tokens_from_corpus() helper so CodingContentGenerator can build its pools without sharing PromptGenerator state.
  • PromptCorpus enum (sonnet / coding) in aiperf.common.enums.
  • corpus field on PromptConfig (drops the parent-prefix to match the existing isl/osl/block_size/batch_size convention on PromptConfig).
  • Flat CLI flag --prompt-corpus (cli_config.py:prompt_corpus) — stays prefixed to match the existing prompt_batch_size / prompt_sequence_distribution flat-CLI convention.
  • default_prompt_corpus field on Custom/PublicDatasetLoaderMetadata (declared as Literal["sonnet", "coding"] in schemas.py to dodge the aiperf.common.enumsaiperf.plugin chain circular import; mirrors PromptCorpus.value).
  • composer/custom.py wiring: when an is_trace loader runs and the effective corpus is CODING, swap in CodingContentGenerator instead of the default Shakespeare PromptGenerator. Reads corpus from self._synthetic_prompts when set; otherwise falls back to loader_metadata.default_prompt_corpus (which defaults to sonnet, preserving existing behavior).
  • tests/unit/dataset/generator/test_coding_content_generator.py — 76 unit tests covering init, generate(), token-sequence building, sampling, and per-language template families.

What's out of scope (deliberately deferred to follow-up PRs)

  • composer/public.py trace-loader wiring — main's PublicDatasetLoaderMetadata has no is_trace field; the branch's _inject_trace_kwargs depends on it. Adding is_trace to public-loader metadata is a separate, broader change.
  • Weka / semianalysis_cc_traces_weka_no_subagents HF loaders that would set default_prompt_corpus: coding. These are the consumers of this generator; they'll land alongside the Weka trace replay work.
  • The Weka-specific BPE-stable-terminator + hash-id-RNG threading additions to PromptGenerator that live on the source branch — those are Weka trace-replay concerns, not coding-content concerns.

Notes

  • Pure additive change. With default_prompt_corpus defaulting to sonnet everywhere and no loader on main setting coding, the new code path is opt-in via --prompt-corpus coding only. Behavior is unchanged unless the user (or a future loader plugin) requests it.
  • tools/ergonomics_baseline.json was regenerated to accept the corpus-as-data file/function sizes in coding_content.py. The 4082-line file is overwhelmingly hand-written templates and vocabulary tuples; splitting it doesn't make the data smaller and would obscure the per-template-family co-location.

Test plan

  • uv run pytest tests/unit/ -n auto — 12749 passed, 79 skipped
  • make validate-plugin-schemas — 33 categories, 219 plugins clean
  • make generate-all-plugin-files — schemas regenerated, no drift
  • pre-commit run — all hooks pass
  • Smoke-tested CodingContentGenerator(...).generate_prompt(num_tokens=64) against a real gpt2 tokenizer end-to-end
  • Reviewer to spot-check a representative sample of generated templates by language family (_gen_python_*, _gen_go_*, _gen_rust_*, _gen_ts_*, _gen_tool_*)

Summary by CodeRabbit

  • New Features

    • Added --prompt-corpus CLI flag to choose synthetic prompt source (sonnet or coding)
    • Added support for a “coding” prompt corpus and deterministic per-trace/hash prompt generation
  • Documentation

    • Documented the new --prompt-corpus option and loader-default behavior
  • Schema

    • Added plugin metadata field to declare a loader’s default prompt corpus
  • Tests

    • Added comprehensive tests for coding prompt generation and composer behavior

Review Change Stack

…-trace replay

Adds a token-pool prompt generator that produces structurally plausible,
template-filled coding content (code, bash output, JSON, error tracebacks,
git diffs, CI output, configs, markdown, test output, user prompts). The
output is not real code — it's filler whose token distribution approximates
real coding-agent traffic closely enough to drive realistic expert-routing
patterns on Mixture-of-Experts models. PromptGenerator's Shakespeare corpus
routes to a narrow subset of experts (English-prose experts dominate), which
is a poor benchmark proxy for production coding workloads — coding-flavored
content activates the broader set of experts that real agentic-coding
traffic hits, so MoE serving metrics (per-expert load, hot/cold imbalance,
throughput under realistic expert pressure) measured under this corpus are
representative of agentic-coding production rather than literature.

This completes #839/#943: main already ships the agentic_code_gen session
SYNTHESIZER (hash_ids + input/output lengths + delays in Mooncake-style
JSONL), but has no coding-text generator on the runtime side. Without this
PR, synthesized sessions executed against a real model use Shakespeare
filler and the resulting MoE expert-routing pattern is unrepresentative.

Added:
- src/aiperf/dataset/generator/coding_content.py: CodingContentGenerator
  (4082 lines; ~673 vocab tuples + ~1386 lines inside triple-quoted code
  template literals + ~1.3k lines of generator scaffolding)
- src/aiperf/common/hash_id_random_generator.py: deterministic per-hash-id
  RNG used by the generator's tool-pool sampling
- tests/unit/dataset/generator/test_coding_content_generator.py: 76 unit
  tests covering init, generate(), token sequence building, sampling, and
  per-language template families
- PromptCorpus enum (sonnet/coding) in aiperf.common.enums
- corpus field on PromptConfig with CLI flag --prompt-corpus (the Pydantic
  field drops the parent-prefix to match the existing isl/osl/block_size/
  batch_size convention; the flat CLI field stays prompt_corpus to match
  prompt_batch_size / prompt_sequence_distribution)
- default_prompt_corpus field on Custom/Public DatasetLoaderMetadata
  (declared as Literal["sonnet", "coding"] in schemas.py to avoid the
  aiperf.common.enums -> aiperf.plugin chain circular import)
- sample_tokens_from_corpus() module-level helper on prompt.py (used by
  CodingContentGenerator to build its tool/text pools without sharing
  PromptGenerator state)
- composer/custom.py wiring: when an is_trace loader runs and the
  effective corpus is CODING, swap in CodingContentGenerator instead of
  the default Shakespeare PromptGenerator. Reads corpus from the
  synthetic-dataset prompts config when present and falls back to the
  loader plugin's default_prompt_corpus.

Not included (out of scope for this PR, can land separately):
- composer/public.py trace-loader wiring (main's PublicDatasetLoaderMetadata
  has no is_trace field; branch's _inject_trace_kwargs depends on it)
- Weka/semianalysis trace loaders that would set default_prompt_corpus=coding
- The Weka-specific BPE-stable-terminator + hash-id RNG threading work on
  PromptGenerator (Weka trace replay concern, not coding-content)

Verified:
- pytest tests/unit/ -n auto: 12749 passed, 79 skipped
- make validate-plugin-schemas: 33 categories, 219 plugins clean
- make generate-all-plugin-files: schemas regenerated
- pre-commit run: all hooks pass (ergonomics baseline regenerated to
  accept the corpus-as-data file/function sizes)

Signed-off-by: Anthony Casagrande <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7e8edc5215229d2b8cccc73800d22854b4e6ba9e

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7e8edc5215229d2b8cccc73800d22854b4e6ba9e

Last updated for commit: 7e8edc5Browse code

@github-actions github-actions Bot added the feat label May 20, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Walkthrough

This PR introduces a --prompt-corpus CLI option to select synthetic prompt corpus sources (sonnet or coding) for dataset generation. The change flows from CLI flags through configuration objects, plugin metadata defaults, dataset composer logic, and a token-sampling helper, and includes deterministic per-hash-ID RNG support and tests.

Changes

Prompt Corpus Selection and Dataset Generation

Layer / File(s) Summary
PromptCorpus enum and type definitions
src/aiperf/common/enums/enums.py, src/aiperf/common/enums/__init__.py, src/aiperf/config/schema/aiperf-config.schema.json
New PromptCorpus enum with SONNET and CODING members; re-exported via enums package; schema definition added to config.
CLI and configuration integration
src/aiperf/config/flags/cli_config.py, src/aiperf/config/dataset/content.py, src/aiperf/config/flags/_converter_dataset.py, src/aiperf/config/flags/_section_fields.py, docs/cli-options.md
CLIConfig and PromptConfig gain prompt_corpus fields; converters and input field tracking updated; CLI docs describe corpus options and loader-default fallback behavior.
Plugin metadata schema for corpus defaults
src/aiperf/plugin/schema/schemas.py, src/aiperf/plugin/schema/plugins.schema.json
Both JSON and Python schemas add default_prompt_corpus to loader metadata, specifying which corpus a plugin uses when not explicitly overridden.
HashIdRandomGenerator for deterministic trace processing
src/aiperf/common/hash_id_random_generator.py
New class providing per-(trace_id, hash_id) deterministic random sequences via SHA-256 re-seeding, with NumPy RNG disabling.
Token sampling utility for corpus-based generation
src/aiperf/dataset/generator/prompt.py
New sample_tokens_from_corpus function samples token sequences with optional separator prepending and wrap-around when span exceeds corpus length.
Trace dataset loader routing by corpus
src/aiperf/dataset/composer/custom.py
CustomDatasetComposer conditionally initializes CodingContentGenerator for CODING corpus or uses existing prompt generator otherwise.
BaseDatasetComposer prompt-generator factory
src/aiperf/dataset/composer/base.py
Adds _create_prompt_generator helper and updates prompt_generator typing/initialization to support CodingContentGenerator when corpus is CODING.
Synthetic composer unit tests
tests/unit/dataset/composer/test_synthetic_composer.py
Tests verifying CodingContentGenerator selection and that coding-corpus options populate coding-specific prompt fields.
CodingContentGenerator unit tests
tests/unit/dataset/generator/test_coding_content_generator.py
Comprehensive test suite covering initialization, generation, token sequences, caching, sampling, templates, language pools, conversation patterns, and seed determinism.
Ergonomics baseline updates
tools/ergonomics_baseline.json
Baseline entries added for pre-existing code-quality violations in CodingContentGenerator and schema modules.

🎯 2 (Simple) | ⏱️ ~12 minutes


🐰 A corpus now flows through configs and CLI with ease,
With hash-IDs seeded for reproducible peace—
Sonnet or coding, the loader will choose,
Deterministic traces, no randomness to lose!
Tests verify pools and prompts from start to end.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: adding CodingContentGenerator to enable pseudo-realistic coding-trace replay.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/common/hash_id_random_generator.py`:
- Line 28: Add missing type hints: annotate the __getattr__ method signature as
def __getattr__(self, name: str) -> Any (import Any from typing if not already),
and annotate the constructor as def __init__(self, ...) -> None (add -> None to
its signature). Update imports to include Any when adding the __getattr__ return
type.
- Around line 48-49: from_base_rng incorrectly treats base_rng.seed == 0 as
falsy; change the logic in from_base_rng to check explicitly for None (e.g., use
"if base_rng.seed is not None then base_seed = base_rng.seed else base_seed =
base_rng.randrange(0, 2**64)") so a legitimate seed of 0 is preserved; update
the return path that constructs the instance (reference: class method
from_base_rng and attribute base_rng.seed) to use that explicit None check
instead of "or".

In `@src/aiperf/config/flags/_converter_dataset.py`:
- Around line 82-83: The CLI `--prompt-corpus` value is being added to the local
`prompts` dict (when `cli.prompt_corpus` is set) but later `_apply_dataset_type`
removes `prompts` for file datasets, making `--prompt-corpus` a no-op; update
`_apply_dataset_type` (or the dataset conversion flow that strips `prompts`) to
either preserve `prompts["corpus"]` for supported file/trace loaders (check the
loader type/format and pass `prompts` through to the file/trace loader
constructors) or validate early and raise a clear error when a provided
`cli.prompt_corpus` is incompatible with the chosen dataset type; locate
references to `prompts`, `cli.prompt_corpus`, and `_apply_dataset_type` in
_converter_dataset.py and implement the preservation or explicit rejection
accordingly.

In `@src/aiperf/dataset/generator/prompt.py`:
- Around line 53-60: The wrap logic may return fewer than num_tokens when
num_tokens > corpus_len because it only wraps once; change the logic in the
token-selection section (around start = rng_to_use.randrange(corpus_len), end =
start + num_tokens and the tokens.extend(...) branches) to guarantee exactly
num_tokens elements by iterating or using modulo arithmetic over corpus indices
(e.g., append corpus[(start + i) % corpus_len] for i in range(num_tokens) or
loop extending slices until len(tokens) == num_tokens) so tokens always reaches
the requested length regardless of num_tokens relative to corpus_len.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23f3a421-fa90-4ad9-aa86-e9e3f3a24bb3

📥 Commits

Reviewing files that changed from the base of the PR and between bb6f421 and efdba69.

📒 Files selected for processing (16)
  • docs/cli-options.md
  • src/aiperf/common/enums/__init__.py
  • src/aiperf/common/enums/enums.py
  • src/aiperf/common/hash_id_random_generator.py
  • src/aiperf/config/dataset/content.py
  • src/aiperf/config/flags/_converter_dataset.py
  • src/aiperf/config/flags/_section_fields.py
  • src/aiperf/config/flags/cli_config.py
  • src/aiperf/config/schema/aiperf-config.schema.json
  • src/aiperf/dataset/composer/custom.py
  • src/aiperf/dataset/generator/coding_content.py
  • src/aiperf/dataset/generator/prompt.py
  • src/aiperf/plugin/schema/plugins.schema.json
  • src/aiperf/plugin/schema/schemas.py
  • tests/unit/dataset/generator/test_coding_content_generator.py
  • tools/ergonomics_baseline.json

class _DisabledNumpyRNG:
"""Raises on any attribute access to prevent NumPy RNG usage."""

def __getattr__(self, name):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n src/aiperf/common/hash_id_random_generator.py

Repository: ai-dynamo/aiperf

Length of output: 3480


Add missing function type hints.

__getattr__ on line 28 is missing type annotations for the name parameter and return type. __init__ on line 51 is missing the -> None return type annotation. Per coding guidelines, all functions require complete type hints (params and return).

Proposed fix
+from typing import Never
+
 class _DisabledNumpyRNG:
     """Raises on any attribute access to prevent NumPy RNG usage."""
 
-    def __getattr__(self, name):
+    def __getattr__(self, name: str) -> Never:
         raise RuntimeError(
             "HashIdRandomGenerator does not support NumPy RNG operations. "
             "Use Python RNG methods (randrange, choice, etc.) instead."
         )
 
 class HashIdRandomGenerator(RandomGenerator):
     """RandomGenerator that re-seeds deterministically per (trace_id, hash_id).
 
     Designed for parallel processing where multiple workers need to generate
     identical content for the same hash_id within a trace file.
 
     Thread Safety:
         NOT thread-safe. Each worker process must have its own instance.
     """
 
     `@classmethod`
     def from_base_rng(cls, base_rng: RandomGenerator) -> "HashIdRandomGenerator":
         """Create from a base RandomGenerator (typically from rng.derive())."""
         base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
         return cls(base_seed, _internal=True)
 
-    def __init__(self, base_seed: int, *, _internal: bool = False):
+    def __init__(self, base_seed: int, *, _internal: bool = False) -> None:
         super().__init__(base_seed, _internal=_internal)
         self._numpy_rng = _DisabledNumpyRNG()
         self._trace_id: str = ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/common/hash_id_random_generator.py` at line 28, Add missing type
hints: annotate the __getattr__ method signature as def __getattr__(self, name:
str) -> Any (import Any from typing if not already), and annotate the
constructor as def __init__(self, ...) -> None (add -> None to its signature).
Update imports to include Any when adding the __getattr__ return type.

Comment on lines +48 to +49
base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
return cls(base_seed, _internal=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
seed = 0
fallback = 123
print(seed or fallback)  # demonstrates current behavior picks fallback instead of 0
PY

Repository: ai-dynamo/aiperf

Length of output: 64


🏁 Script executed:

cat -n src/aiperf/common/hash_id_random_generator.py

Repository: ai-dynamo/aiperf

Length of output: 3480


🏁 Script executed:

rg "class RandomGenerator" -A 20 src/aiperf/

Repository: ai-dynamo/aiperf

Length of output: 1811


🏁 Script executed:

rg "self\.seed" src/aiperf/common/random_generator.py -B 2 -A 2

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

cat -n src/aiperf/common/random_generator.py | head -80

Repository: ai-dynamo/aiperf

Length of output: 3380


🏁 Script executed:

cat -n src/aiperf/common/random_generator.py | sed -n '81,150p'

Repository: ai-dynamo/aiperf

Length of output: 3100


🏁 Script executed:

cat -n src/aiperf/common/random_generator.py | sed -n '118,122p'

Repository: ai-dynamo/aiperf

Length of output: 232


Fix seed value 0 being treated as falsy in from_base_rng.

Line 48 uses base_rng.seed or base_rng.randrange(0, 2**64), which treats seed value 0 as falsy. Since 0 is documented as a valid seed (per RandomGenerator's docstring: "Optional random seed (0 to 2^64-1)"), this breaks reproducibility when seed is legitimately 0. Replace the or operator with an explicit is not None check.

Proposed fix
-        base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
+        base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
return cls(base_seed, _internal=True)
base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64)
return cls(base_seed, _internal=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/common/hash_id_random_generator.py` around lines 48 - 49,
from_base_rng incorrectly treats base_rng.seed == 0 as falsy; change the logic
in from_base_rng to check explicitly for None (e.g., use "if base_rng.seed is
not None then base_seed = base_rng.seed else base_seed = base_rng.randrange(0,
2**64)") so a legitimate seed of 0 is preserved; update the return path that
constructs the instance (reference: class method from_base_rng and attribute
base_rng.seed) to use that explicit None check instead of "or".

Comment on lines +82 to +83
if "prompt_corpus" in s and cli.prompt_corpus is not None:
prompts["corpus"] = cli.prompt_corpus
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--prompt-corpus is silently discarded for file-dataset paths

Line [82] stores prompt_corpus under prompts, but file datasets later remove prompts in _apply_dataset_type (Lines [330]-[338]). That makes --prompt-corpus a no-op for --input-file flows instead of applying or failing fast. Please preserve corpus for supported file/trace loaders (or explicitly reject it for unsupported file formats) rather than dropping it silently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/config/flags/_converter_dataset.py` around lines 82 - 83, The CLI
`--prompt-corpus` value is being added to the local `prompts` dict (when
`cli.prompt_corpus` is set) but later `_apply_dataset_type` removes `prompts`
for file datasets, making `--prompt-corpus` a no-op; update
`_apply_dataset_type` (or the dataset conversion flow that strips `prompts`) to
either preserve `prompts["corpus"]` for supported file/trace loaders (check the
loader type/format and pass `prompts` through to the file/trace loader
constructors) or validate early and raise a clear error when a provided
`cli.prompt_corpus` is incompatible with the chosen dataset type; locate
references to `prompts`, `cli.prompt_corpus`, and `_apply_dataset_type` in
_converter_dataset.py and implement the preservation or explicit rejection
accordingly.

Comment on lines +53 to +60
start = rng_to_use.randrange(corpus_len)
end = start + num_tokens

if end <= corpus_len:
tokens.extend(corpus[start:end])
else:
tokens.extend(corpus[start:])
tokens.extend(corpus[: end - corpus_len])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix wrap logic to always return the requested token count.

The current branch only wraps once. For larger num_tokens, it can return fewer tokens than requested, which breaks the helper’s “exact length” contract.

Suggested fix
-    start = rng_to_use.randrange(corpus_len)
-    end = start + num_tokens
-
-    if end <= corpus_len:
-        tokens.extend(corpus[start:end])
-    else:
-        tokens.extend(corpus[start:])
-        tokens.extend(corpus[: end - corpus_len])
+    if num_tokens <= 0:
+        return tokens
+    if corpus_len == 0:
+        raise ValueError("corpus must be non-empty when num_tokens > 0")
+
+    start = rng_to_use.randrange(corpus_len)
+    tokens.extend(
+        corpus[(start + i) % corpus_len]
+        for i in range(num_tokens)
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/dataset/generator/prompt.py` around lines 53 - 60, The wrap logic
may return fewer than num_tokens when num_tokens > corpus_len because it only
wraps once; change the logic in the token-selection section (around start =
rng_to_use.randrange(corpus_len), end = start + num_tokens and the
tokens.extend(...) branches) to guarantee exactly num_tokens elements by
iterating or using modulo arithmetic over corpus indices (e.g., append
corpus[(start + i) % corpus_len] for i in range(num_tokens) or loop extending
slices until len(tokens) == num_tokens) so tokens always reaches the requested
length regardless of num_tokens relative to corpus_len.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 83.05085% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/dataset/composer/custom.py 50.00% 3 Missing and 1 partial ⚠️
src/aiperf/dataset/generator/prompt.py 69.23% 2 Missing and 2 partials ⚠️
src/aiperf/common/hash_id_random_generator.py 90.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Route synthetic non-trace prompt generation through CodingContentGenerator when requested so MoE-oriented coding content applies beyond trace replay.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Anthony Casagrande <[email protected]>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/dataset/composer/test_synthetic_composer.py`:
- Line 33: The test names don't follow the required
test_<function>_<scenario>_<expected> convention; rename the function
test_initialization_with_coding_corpus and the other test mentioned at line 46
to follow that pattern (for example
test_initialization_with_coding_corpus_success or
test_initialization_with_coding_corpus_validates_tokenization) so they
explicitly state the function under test, the scenario, and the expected
outcome; update any references/imports in the test module (e.g., the class
containing these test methods and usages of mock_tokenizer) to use the new
names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dfe021d5-aa5d-473c-b2b9-5f40ebb304ad

📥 Commits

Reviewing files that changed from the base of the PR and between efdba69 and 7e8edc5.

📒 Files selected for processing (3)
  • src/aiperf/dataset/composer/base.py
  • src/aiperf/dataset/generator/coding_content.py
  • tests/unit/dataset/composer/test_synthetic_composer.py

assert composer.include_image is False
assert composer.include_audio is False

def test_initialization_with_coding_corpus(self, mock_tokenizer):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename these tests to match the required naming contract.

Line 33 and Line 46 do not follow the test_<function>_<scenario>_<expected> convention.

Proposed rename
-    def test_initialization_with_coding_corpus(self, mock_tokenizer):
+    def test_initialization_coding_corpus_uses_coding_content_generator(self, mock_tokenizer):
@@
-    def test_coding_corpus_generates_context_prompts(self, mock_tokenizer):
+    def test_create_dataset_coding_corpus_populates_context_prompts(self, mock_tokenizer):

As per coding guidelines: "Test naming convention: test_<function>_<scenario>_<expected>".

Also applies to: 46-46

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/dataset/composer/test_synthetic_composer.py` at line 33, The test
names don't follow the required test_<function>_<scenario>_<expected>
convention; rename the function test_initialization_with_coding_corpus and the
other test mentioned at line 46 to follow that pattern (for example
test_initialization_with_coding_corpus_success or
test_initialization_with_coding_corpus_validates_tokenization) so they
explicitly state the function under test, the scenario, and the expected
outcome; update any references/imports in the test module (e.g., the class
containing these test methods and usages of mock_tokenizer) to use the new
names.

@ajcasagrande ajcasagrande added the AgentX Feature for AgentX label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AgentX Feature for AgentX feat

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant