-
Notifications
You must be signed in to change notification settings - Fork 91
feat(dataset): add CodingContentGenerator for pseudo-realistic coding-trace replay #968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,77 @@ | ||||||||||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||||||
| # SPDX-License-Identifier: Apache-2.0 | ||||||||||
|
|
||||||||||
| """Hash-ID-based random generator for parallel processing with reproducibility. | ||||||||||
|
|
||||||||||
| Enables parallel processing of traces with hash_ids while maintaining | ||||||||||
| reproducibility. Each (trace_id, hash_id) pair produces a deterministic random | ||||||||||
| sequence regardless of worker count or processing order. | ||||||||||
|
|
||||||||||
| Architecture: | ||||||||||
| Global Seed -> Base RNG -> (trace_id, hash_id) -> Deterministic tokens | ||||||||||
|
|
||||||||||
| The trace_id (typically a content hash of the trace file) ensures that different | ||||||||||
| trace files with overlapping hash_id values produce different content, while the | ||||||||||
| same trace file always produces identical results. | ||||||||||
| """ | ||||||||||
|
|
||||||||||
| import hashlib | ||||||||||
|
|
||||||||||
| from aiperf.common.random_generator import RandomGenerator | ||||||||||
|
|
||||||||||
| __all__ = ["HashIdRandomGenerator"] | ||||||||||
|
|
||||||||||
|
|
||||||||||
| class _DisabledNumpyRNG: | ||||||||||
| """Raises on any attribute access to prevent NumPy RNG usage.""" | ||||||||||
|
|
||||||||||
| def __getattr__(self, name): | ||||||||||
| raise RuntimeError( | ||||||||||
| "HashIdRandomGenerator does not support NumPy RNG operations. " | ||||||||||
| "Use Python RNG methods (randrange, choice, etc.) instead." | ||||||||||
| ) | ||||||||||
|
|
||||||||||
|
|
||||||||||
| class HashIdRandomGenerator(RandomGenerator): | ||||||||||
| """RandomGenerator that re-seeds deterministically per (trace_id, hash_id). | ||||||||||
|
|
||||||||||
| Designed for parallel processing where multiple workers need to generate | ||||||||||
| identical content for the same hash_id within a trace file. | ||||||||||
|
|
||||||||||
| Thread Safety: | ||||||||||
| NOT thread-safe. Each worker process must have its own instance. | ||||||||||
| """ | ||||||||||
|
|
||||||||||
| @classmethod | ||||||||||
| def from_base_rng(cls, base_rng: RandomGenerator) -> "HashIdRandomGenerator": | ||||||||||
| """Create from a base RandomGenerator (typically from rng.derive()).""" | ||||||||||
| base_seed = base_rng.seed or base_rng.randrange(0, 2**64) | ||||||||||
| return cls(base_seed, _internal=True) | ||||||||||
|
Comment on lines
+48
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
python - <<'PY'
seed = 0
fallback = 123
print(seed or fallback) # demonstrates current behavior picks fallback instead of 0
PYRepository: ai-dynamo/aiperf Length of output: 64 🏁 Script executed: cat -n src/aiperf/common/hash_id_random_generator.pyRepository: ai-dynamo/aiperf Length of output: 3480 🏁 Script executed: rg "class RandomGenerator" -A 20 src/aiperf/Repository: ai-dynamo/aiperf Length of output: 1811 🏁 Script executed: rg "self\.seed" src/aiperf/common/random_generator.py -B 2 -A 2Repository: ai-dynamo/aiperf Length of output: 42 🏁 Script executed: cat -n src/aiperf/common/random_generator.py | head -80Repository: ai-dynamo/aiperf Length of output: 3380 🏁 Script executed: cat -n src/aiperf/common/random_generator.py | sed -n '81,150p'Repository: ai-dynamo/aiperf Length of output: 3100 🏁 Script executed: cat -n src/aiperf/common/random_generator.py | sed -n '118,122p'Repository: ai-dynamo/aiperf Length of output: 232 Fix seed value 0 being treated as falsy in Line 48 uses Proposed fix- base_seed = base_rng.seed or base_rng.randrange(0, 2**64)
+ base_seed = base_rng.seed if base_rng.seed is not None else base_rng.randrange(0, 2**64)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||
|
|
||||||||||
| def __init__(self, base_seed: int, *, _internal: bool = False): | ||||||||||
| super().__init__(base_seed, _internal=_internal) | ||||||||||
| self._numpy_rng = _DisabledNumpyRNG() | ||||||||||
| self._trace_id: str = "" | ||||||||||
|
|
||||||||||
| def set_trace_id(self, trace_id: str) -> None: | ||||||||||
| """Set trace identifier to scope hash_ids to a specific trace file. | ||||||||||
|
|
||||||||||
| Args: | ||||||||||
| trace_id: Content hash or unique identifier for the trace file. | ||||||||||
| Different trace files must use different trace_ids. | ||||||||||
| """ | ||||||||||
| self._trace_id = trace_id | ||||||||||
|
|
||||||||||
| def reseed_for_hash_id(self, hash_id: int) -> None: | ||||||||||
| """Re-seed RNG deterministically for a specific hash_id. | ||||||||||
|
|
||||||||||
| After calling, all random operations use the derived seed until | ||||||||||
| the next reseed_for_hash_id call. | ||||||||||
|
|
||||||||||
| Args: | ||||||||||
| hash_id: KV block hash ID from trace data. | ||||||||||
| """ | ||||||||||
| seed_bytes = hashlib.sha256( | ||||||||||
| f"{self.seed}:{self._trace_id}:{hash_id}".encode() | ||||||||||
| ).digest() | ||||||||||
| self._python_rng.seed(int.from_bytes(seed_bytes[:8], "big")) | ||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -79,6 +79,8 @@ def _build_prompts(cli: CLIConfig) -> dict[str, Any]: | |
| prompts["block_size"] = cli.prompt_input_tokens_block_size | ||
| if "prompt_batch_size" in s: | ||
| prompts["batch_size"] = cli.prompt_batch_size | ||
| if "prompt_corpus" in s and cli.prompt_corpus is not None: | ||
| prompts["corpus"] = cli.prompt_corpus | ||
|
Comment on lines
+82
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Line [82] stores 🤖 Prompt for AI Agents |
||
| return prompts | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: ai-dynamo/aiperf
Length of output: 3480
Add missing function type hints.
__getattr__on line 28 is missing type annotations for thenameparameter and return type.__init__on line 51 is missing the-> Nonereturn type annotation. Per coding guidelines, all functions require complete type hints (params and return).Proposed fix
🤖 Prompt for AI Agents