feat(media-mix): add support for weighted multimodal request archetypes#938
feat(media-mix): add support for weighted multimodal request archetypes#938matthewkotila wants to merge 18 commits into
Conversation
…types
AIPerf's existing multimodal benchmarking is all-or-nothing: if images are
enabled, every request gets images. This commit adds a media_mix config
that defines weighted request archetypes, so a single benchmark can model
realistic mixed-modality traffic (e.g., 60% image-and-audio, 30%
video-analysis, 10% text-only) with per-modality dimensional variation.
New src/aiperf/common/config/media_mix_config.py:
- MediaMixArchetype, ModalityEntry, and per-modality profile configs
(ImageProfileConfig, AudioProfileConfig, VideoProfileConfig)
- TextOverrideConfig for per-archetype ISL/OSL overrides; unspecified
fields fall back to the global PromptConfig
- parse_media_mix() for CLI shorthand like "image:0.6,video:0.4"
New src/aiperf/dataset/composer/media_mix_resolver.py:
- ResolvedTurn dataclass carrying per-turn generator selections
- MediaMixResolver pre-creates per-(archetype, profile) generators with
unique RNG namespaces, then on each turn samples an archetype by
weight, a profile per modality by weight, and returns the resolved
generators for SyntheticDatasetComposer to invoke
Generator changes:
- ImageGenerator/AudioGenerator/VideoGenerator gain optional
rng_namespace param to keep per-profile RNG streams independent
InputConfig:
- New media_mix field with model_validator(mode="before") that parses
the shorthand string and inflates {modality, weight} sentinels into
full archetype dicts using the sibling image/audio/video config
SyntheticDatasetComposer._create_turn dispatches to a new
_create_media_mix_turn when the resolver is present, with helper
methods for populating per-modality payloads and applying turn delay
and resolved sequence-length overrides.
Tests cover config validation, shorthand parsing, resolver sampling
distribution, profile-modality matching, per-archetype text overrides,
and the composer integration end to end.
Signed-off-by: Matthew Kotila <[email protected]>
…lumbing - Replace stale `_CLI_GROUP` reference (removed from InputConfig in cyclopts 3.14 fix #878/#879) with `Groups.INPUT` for the new --media-mix field. Was causing import-time NameError. - Extract shorthand inflation helpers from input_config.py to media_mix_config.py to satisfy check-ergonomics (file size <500 lines) and check-ruff-baselined (function complexity <=10). - Extract per-modality population + delay + sequence-length caching from _create_media_mix_turn into helper methods to satisfy check-ruff-baselined (complexity <=10). - Add archetype_name to ResolvedTurn, Turn, and MetricRecordMetadata so per-archetype metrics can be grouped during reporting (Step 7a foundation). All 8961 unit tests pass, all pre-commit hooks pass. Signed-off-by: Matthew Kotila <[email protected]>
Per-archetype metrics processing uses MediaMixArchetype.name as the dict
key for grouping records. Two issues that this validator fixes:
1. Multiple unnamed archetypes (name: None) would all collide under the
same key, conflating distinct request types.
2. Two archetypes intentionally given the same name would silently merge,
producing meaningless per-archetype output.
Add an InputConfig model_validator(mode="after") that runs after media_mix
shorthand inflation:
- Auto-assigns _archetype_{i} to any archetype with name=None
- Rejects remaining duplicate names with a clear error
This guarantees every archetype has a unique non-None name by the time
the resolver and the upcoming archetype results processor see it.
Tests in tests/unit/common/config/test_media_mix_config.py.
Signed-off-by: Matthew Kotila <[email protected]>
Two data-model additions for upcoming per-archetype metrics (media mix): ProfileResults: new archetype_metric_results field carrying dict[str, list[MetricResult]] alongside the existing records and timeslice_metric_results. Each key is a MediaMixArchetype.name; each value is the list-of-MetricResult shape used by the aggregate records. JsonExportData: new ArchetypeData class with extra=allow so dynamic metric fields can be populated at runtime via setattr (same pattern as JsonExportData itself). archetypes: list[ArchetypeData] | None added to JsonExportData. SCHEMA_VERSION bumped 1.1 -> 1.2. Cross-referencing the full archetype config from output JSON is done by joining archetypes[i].archetype_name against input_config.input.media_mix[] (which is already serialized in the export today). Update test_metrics_json_exporter test assertion for the new version. Signed-off-by: Matthew Kotila <[email protected]>
…ricRecordMetadata The base MetricResultsProcessor.get_instances_map / get_results methods previously took request_start_ns, which implicitly assumed the only grouping dimension was timeslice. Generalizing the parameter to the full MetricRecordMetadata lets subclasses extract whatever grouping key they need (timeslice index today, archetype name in an upcoming commit, others in future) without further base-class changes. TimesliceMetricResultsProcessor pulls request_start_ns out of the metadata as before; behavior is identical. Updates the corresponding tests to construct MetricRecordMetadata instead of bare ints when calling get_instances_map / get_results directly. Signed-off-by: Matthew Kotila <[email protected]>
The existing RecordsManager._process_results dispatch used isinstance on the summarize() return value: list -> records, dict -> timeslice. That collides as soon as a second dict-returning processor (like the upcoming ArchetypeMetricResultsProcessor with dict[str, list]) is plugged in, since both subclasses' returns are Python dicts. Replace with a class-attribute discriminator: - MetricResultsProcessor.result_kind = 'records' - TimesliceMetricResultsProcessor.result_kind = 'timeslice' - (next commit will add 'archetype' kind) RecordsManager wraps each summarize() call to return (kind, payload) and routes the payload by kind. Unknown kinds are logged and dropped, not silently merged into an existing bucket. Tests in tests/unit/records/test_records_manager.py cover the kind declarations and confirm subclasses must explicitly override. Signed-off-by: Matthew Kotila <[email protected]>
Per-archetype metric aggregation for media mix benchmarks. Groups
incoming MetricRecordsData by metadata.archetype_name (set by the
SyntheticDatasetComposer during dataset generation) and computes
metrics independently per archetype, mirroring how the timeslice
processor groups by time-window index.
Architecture mirrors TimesliceMetricResultsProcessor:
- defaultdict[str, dict[MetricTagT, BaseMetric]] for per-archetype
metric instances (auto-allocated on first record)
- defaultdict[str, MetricResultsDict] for per-archetype results
- Overrides get_instances_map/get_results to route by archetype_name
- summarize() returns dict[str, list[MetricResult]]; the
RecordsManager dispatches it via result_kind='archetype'
Self-disables when InputConfig.media_mix is unconfigured, so users
running non-media-mix benchmarks see no behavioral change.
Registered in plugins.yaml as results_processor.archetype.
Baseline regenerated to include the BLE001 entry mirroring the
existing pattern in TimesliceMetricResultsProcessor.update_derived_metrics.
Tests in tests/unit/post_processors/test_archetype_metric_results_processor.py
cover the self-disable, the per-archetype separation, the summarize
shape, and synthetic _archetype_{i} naming.
Signed-off-by: Matthew Kotila <[email protected]>
Extend the RecordsManager dispatch loop to route 'archetype' kind payloads into a new archetype_metric_results bucket, then pass it into ProfileResults at the end of the run. Falls back to None (rather than an empty dict) when no archetype processor ran, keeping the JSON output free of an empty 'archetypes' section for non-media-mix benchmarks. The dispatch loop got extracted into _dispatch_processor_outcomes to keep _process_results within the branch budget enforced by check-ruff-baselined. Signed-off-by: Matthew Kotila <[email protected]>
When ProfileResults.archetype_metric_results is populated (media mix mode), MetricsJsonExporter emits an additional 'archetypes' array alongside the existing top-level aggregate metrics. Each entry carries the archetype's identity (archetype_name + archetype_weight) plus the same dynamic metric fields the top level uses, via extra=allow. Non-media-mix benchmarks see no change: getattr+exclude_none keep the array out of the output entirely. Cross-referencing the full archetype config (profiles, dimensions, formats) is done by joining archetypes[i].archetype_name against input_config.input.media_mix[] which is already in the export. Tests added: archetypes array populated correctly, and archetypes field is absent when media mix is unconfigured. Signed-off-by: Matthew Kotila <[email protected]>
Tidy/long-format CSV export of per-archetype metrics for media mix benchmarks. Schema: Archetype,Metric,Unit,Stat,Value, one row per (archetype, metric, stat) tuple. Optimal input format for downstream pandas/Tableau/ggplot analysis. Mirrors TimesliceMetricsCsvExporter's shape and conventions exactly. Self-disables when ProfileResults.archetype_metric_results is None, so non-media-mix benchmarks don't get an empty CSV file. New default file path: profile_export_aiperf_archetypes.csv, matching the _timeslices.csv naming convention. Registered in plugins.yaml as data_exporter.archetype_csv. The existing --profile-export-prefix suffix-stripping list now recognizes _archetypes.csv so custom prefixes work cleanly. Signed-off-by: Matthew Kotila <[email protected]>
Renders one Rich table per archetype for media mix benchmarks. Sits alongside (not replacing) the existing ConsoleMetricsExporter which still renders the across-archetype aggregate table. Each archetype's table uses the same column set, metric ordering, and formatting as the aggregate so users learn one table layout and see it N+1 times. Table title carries the archetype name and its configured traffic share, e.g.: NVIDIA AIPerf | LLM Metrics: image-only (40% of traffic) Inherits from ConsoleMetricsExporter to reuse the table-building, flag-filtering, sorting, and row-formatting logic. The export() method drives the per-archetype loop directly. Self-disables when ProfileResults.archetype_metric_results is missing so users running non-media-mix benchmarks see no behavioral change. Registered in plugins.yaml as console_exporter.archetype_metrics. Signed-off-by: Matthew Kotila <[email protected]>
- New docs/tutorials/media-mix.md walks through YAML config, weighted archetypes, profile/batch_size distributions, per-archetype text overrides, archetype naming rules, and how to read the per-archetype output in console/JSON/CSV. Linked from README's Endpoint Types tutorial index. - docs/reference/json-export-schema.md documents schema 1.2 (added archetypes array) and shows the join pattern for cross-referencing per-archetype metric blocks against input_config.input.media_mix. Signed-off-by: Matthew Kotila <[email protected]>
…inery The --media-mix CLI flag was non-functional: Cyclopts treats the string value as a list element to be coerced to MediaMixArchetype, never invoking parse_media_mix(). Per team direction, the shorthand was a 'future enhancement' in the original plan; a broken flag is worse than no flag. Remove: - CLIParameter on the media_mix field - parse_media_mix(), normalize_media_mix_input(), _as_dict() - _build_image/audio/video_modality_entry() and _MODALITY_BUILDERS - inflate_shorthand_archetypes(), _inflate_shorthand_entry(), is_shorthand_list() - VALID_MODALITIES constant - InputConfig.inflate_media_mix_shorthand model_validator - TestParseMediaMix class - test_media_mix_shorthand_inflation in the composer tests - 'CLI shorthand: ...' wording from the field description The media_mix field stays — YAML continues to populate it directly via Pydantic. The name-uniqueness validator and per-archetype text override logic are untouched. Follow-up needed: aiperf profile doesn't currently expose --user-config-file, so this commit leaves YAML configs reachable only via the verbose 'aiperf service --type system_controller' path. Adding --user-config-file to profile is the next commit. Signed-off-by: Matthew Kotila <[email protected]>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (37)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b6d2a8deea9a04afe77d234be3ea01d941890bcaRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@b6d2a8deea9a04afe77d234be3ea01d941890bcaLast updated for commit: |
| CLI never touched stay as the YAML loaded them — most importantly the | ||
| `media_mix` array, since no CLI flag targets it. | ||
| """ | ||
| _overlay(base, cli) |
There was a problem hiding this comment.
The CLI/YAML merge mutates an already-validated UserConfig and returns it without re-running model validators, so individually valid file and CLI options can combine into an invalid final config. Fix: Rebuild and validate a fresh UserConfig from the merged data before running the controller, or merge raw dictionaries before constructing UserConfig.
| raise ValueError( | ||
| "The --public-dataset and --custom-dataset-type options cannot be set together" | ||
| ) | ||
| if self.media_mix and ( |
There was a problem hiding this comment.
media_mix still allows input.file, so DatasetManager will use the custom composer and produce records without archetype_name, causing ArchetypeMetricResultsProcessor to fail at runtime. Fix: Treat self.file as a non-synthetic dataset source and reject it whenever media_mix is configured.
| if resolved.input_tokens_mean is not None | ||
| else self.config.input.prompt.input_tokens.mean | ||
| ) | ||
| self._turn_sequence_cache[id(turn)] = (isl, resolved.output_tokens_mean) |
There was a problem hiding this comment.
Per-archetype output_tokens overrides are cached here, but _set_max_tokens only reads that cache when sequence_distribution is active, so normal media_mix runs ignore the documented OSL override. Fix: Have _set_max_tokens consult the per-turn cache before falling back to global output tokens, or set turn.max_tokens directly from the resolved override.
| modality: Literal["image", "audio", "video"] = Field( | ||
| description="Media type: image, audio, or video.", | ||
| ) | ||
| batch_size: int = Field( |
There was a problem hiding this comment.
batch_size is declared as a fixed int with ge=1, so the documented distribution form and min: 0 optional-media case fail validation. Fix: Add a batch-size distribution config, sample it in MediaMixResolver, and allow zero when configured.
Media Mix — weighted multimodal request archetypes
Summary
This branch adds media mix: a way to specify weighted request archetypes (image-only, audio-only, text-only, multi-modal, …) inside a single AIPerf run, with per-archetype metrics broken out alongside the aggregate. Each archetype defines which modalities appear, with what dimensional profiles, and (optionally) with overridden ISL/OSL. Output flows through the existing exporters with a new per-archetype section in the JSON and a per-archetype table in the console.
It is the v1 cut described in the AIP-814 design doc — synthetic-only, no image-pool / reuse-rate, no per-archetype × per-timeslice cross-product.
Why
Today's synthetic generator produces one kind of request per run, so realistic multimodal workloads ("70% text chats + 20% small images + 10% large images") need multiple runs and manual stitching. The per-modality latency story is also lost in the aggregate. Media mix gives users one config that:
Configuration shape
Each archetype carries an optional
text:block —falseto disable text, aTextOverrideConfigto override ISL/OSL just for this archetype, orNone/Trueto inherit the global prompt config.modalitiesis a list of weighted profiles per modality;batch_sizecontrols items-per-request.Architecture
flowchart LR A[YAML media_mix] --> B[InputConfig.validate_*] B --> C[SyntheticDatasetComposer] C --> D[MediaMixResolver.resolve_turn] D --> E[Turn with archetype_name] E --> F[Worker → Records] F --> G[MetricRecordMetadata.archetype_name] G --> H[MetricResultsProcessor] G --> I[ArchetypeMetricResultsProcessor] H --> J[ProfileResults.records] I --> K[ProfileResults.archetype_metric_results] J --> L[Aggregate console / JSON / CSV] K --> M[Per-archetype console / JSON / CSV]Three planes
MediaMixResolversamples an archetype by weight, then samples one profile per modality entry. Per-archetype text overrides are surfaced as aResolvedTurnso the existing prompt generator path picks them up.MetricRecordMetadatagains anarchetype_namefield. The single field is the only cross-plane plumbing; everything else falls out of it.ArchetypeMetricResultsProcessormirrorsTimesliceMetricResultsProcessor— same template, same goodput handling, just keyed on archetype name instead of timeslice index.Dispatch by
result_kindThe records manager previously dispatched processor outputs by Python type —
list→ records,dict→ timeslice. That collides the moment a seconddict-returning processor exists (archetype'sdict[str, ...]vs timeslice'sdict[int, ...]). Refactored to aresult_kind: ClassVar[str]discriminator on the processor base class. Each subclass declares its kind ("records","timeslice","archetype"); the manager routes by string match. Future processors slot in by setting one field.Schema 1.2
profile_export_aiperf.jsonbumps toversion: 1.2, additive only. The newarchetypesarray appears alongside the existing top-level metric block; consumers that ignore unknown fields are unaffected.ArchetypeDatausesextra="allow"to carry dynamic per-metric fields the same wayJsonExportDatadoes.CLI surface
Media mix is YAML-only. There is no
--media-mixflag; the nested shape (archetypes → modalities → profiles, with optional text overrides) does not fit a flat CLI ergonomic, and aimage:0.6,audio:0.4-style shorthand would re-invent half the schema without expressing per-archetype overrides at all.The entry point is
aiperf profile --user-config-file <path>, which loads the fullUserConfigfrom YAML or JSON. Individual CLI flags can be combined with the file:--url,--isl,--concurrency, etc.).media_mix[i].text.input_tokens.meanis a strictly finer scope than--isl, and no CLI flag targets it. This is implemented by a recursivemodel_fields_setwalk: only fields the user explicitly typed propagate from the CLI-builtUserConfigonto the YAML-built one.Output
Aggregate output is unchanged — the existing console table, CSV, and JSON top-level metric block all render byte-for-byte identical when
media_mixis absent or contains a single archetype.When media mix is configured:
({normalized_share}% of traffic)title suffix.archetypesarray sorted by name, each entry carryingarchetype_name,archetype_weight(raw, as configured), and the same dynamic per-metric fields the top-level block uses.All three exporters self-disable when
archetype_metric_resultsis missing, so non-media-mix runs see zero behavioral change.Validation
weight: gt=0on every weighted field (archetype, profile).min_length=1onprofiles.modality(image↔ImageProfileConfig, etc.).text=Falseplus emptymodalitiesis rejected._archetype_0,_archetype_1, …; duplicate names rejected (would silently merge metric buckets).media_mixcombined with--public-datasetor--custom-dataset-typeis rejected at config load. Without this guard, custom-loaded conversations (noarchetype_name) reach the archetype processor and the benchmark hangs forever at PROFILING.Test coverage
tests/unit/common/config/test_media_mix_config.py— config validation (profile/modality/archetype shape, weights, name uniqueness).tests/unit/dataset/composer/test_media_mix.py— resolver weighting, batch-size preservation, text-override propagation, integration throughSyntheticDatasetComposer.tests/unit/post_processors/test_archetype_metric_results_processor.py— per-archetype grouping, goodput, error pass-through.tests/unit/records/test_records_manager.py::TestRecordsManagerDispatchByResultKind— guard against future processors collapsing into the wrong bucket.tests/unit/exporters/*_archetype_*— sort order, weight normalization, self-disable on absent results.Full suite: 9832 unit tests passing locally (one pre-existing OTel-fanout pickling failure unrelated to this branch).
Tutorial
docs/tutorials/media-mix.mdwalks through the YAML shape, the precedence model (CLI vs YAML for global vs per-archetype fields), and how to read the per-archetype output.Notes / learnings along the way
A few things I would have planned differently from the start, captured here so the next iteration doesn't redo the same work:
The
--media-mixshorthand was wishful. The original plan included aimage:0.6,audio:0.4CLI shorthand "for quick experimentation." In practice Cyclopts destructures--media-mix "image:0.6"intomedia_mix[0] = "image:0.6"and Pydantic rejects the string per-element before any model-validator could parse it. The shorthand parser was wired into a model_validator that never fires. Removed entirely (~200 lines of dead code) — there's no version of the shorthand that expresses per-archetype text overrides, so it would have been a permanent half-feature even if Cyclopts cooperated.--user-config-filewas missing onaiperf profile. The plan assumed YAML was the entry point, but no one wired up the flag to load it.aiperf service --user-config-fileexisted;aiperf profiledid not. The tutorial pointed at a non-existent flag. Fixed by mirroring theservicepattern plus amodel_fields_set-based merge so CLI flags can override global YAML fields without trampling per-archetype overrides."CLI overrides config" needs a scope qualifier. The conventional rule is "CLI wins, config loses." For media mix that framing is wrong:
media_mix[i].text.input_tokens.mean: 2000is a finer scope than--isl 100, not a different source for the same value. The user wrote the per-archetype override because they wanted to deviate from the global. The correct rule is "more specific scope wins, regardless of source" — and the implementation gets it for free because no CLI flag targetsmedia_mix[]at all.Type-based dispatch doesn't scale to two
dict-returning processors. The records manager originally routedsummarize()results by Python type (list→ records,dict→ timeslice). Adding archetype results (also adict) silently collided with timeslice output. Refactored to aresult_kind: ClassVar[str]discriminator on the processor base class. Cheap to add, future-proof.Weights need normalization at the display layer, not the config layer. Users naturally write
weight: 3/weight: 7and expect "30% / 70%". The resolver was doing this correctly (_sample_weighteddivides bysum(weights)), but the console exporter was displayingweight * 100and printing "300% / 700% of traffic." The resolver doesn't need to normalize the weights — that would just hide the user's input — but anything that displays a percentage does.Insertion-order iteration is non-deterministic across runs. The JSON exporter iterated
archetype_results.items(), which is the order the first record per archetype arrived at the processor — different across runs, and different from the CSV exporter'ssorted(keys()). Same record set, different output ordering. Fixed by sorting in both places.Validation that prevents a hang is worth more than its line count.
media_mixpluscustom_dataset_typewas silently accepted at config load. Custom-loaded turns have noarchetype_name, so the first record reaching the archetype processor raisedValueErrorfrom a ZMQ pull-client task. The exception was logged as "Task exception was never retrieved" but never propagated. The benchmark logged PROFILING forever and required SIGKILL. Eight lines invalidate_dataset_typeturn that into a clean config-load error.What's deliberately out of scope