Skip to content

feat(mock-server): add ResponsesRequest model with full dispatch plumbing#1000

Open
FrankD412 wants to merge 2 commits into
mainfrom
fdinatale/mock-server-request-recorder
Open

feat(mock-server): add ResponsesRequest model with full dispatch plumbing#1000
FrankD412 wants to merge 2 commits into
mainfrom
fdinatale/mock-server-request-recorder

Conversation

@FrankD412
Copy link
Copy Markdown
Contributor

@FrankD412 FrankD412 commented May 27, 2026

Summary

Follow-up to #962. Introduces ResponsesRequest as a first-class member of RequestT so the mock-server request recorder can capture Responses-specific fields (max_output_tokens, reasoning_effort, stream) instead of the synthetic ChatCompletionRequest the /v1/responses handler builds for the latency simulator.

  • New ResponsesRequest model with a prompt_text property that flattens the Responses input shape (str | list[str|dict] | list[content-block]) into a single string. Flattener logic moved verbatim from app._extract_responses_prompt into models._flatten_responses_input so recorder, tokenizer, and handler share one source of truth.
  • Dispatch wired through models.RequestT, tokens._extract_request_content, tokens._extract_osl_fingerprint, utils._create_request_id (resp-{uuid} prefix), request_recorder._encode_request_prompt_ids, and the app.responses handler signature (req: dict -> req: ResponsesRequest).
  • JSONL schema decision: Responses' max_output_tokens is canonicalized into the existing max_completion_tokens column rather than introducing a new field. Both name the same semantic (the OSL cap); preserving the JSONL schema is more useful for downstream tools than preserving the API name-space, and the endpoint column on each row already disambiguates.

A subsequent commit will wire make_ctx to accept a record-time override so handlers can pass the real payload to the recorder while still driving simulation off the synthetic chat.

Test Plan

  • uv run pytest tests/unit/ -n auto (12881 passed; one unrelated MLflow flake passes in isolation)
  • Unit coverage for prompt_text flattening across all four input shapes
  • Extras (tools, instructions) pass through via BaseModel extra="allow"
  • _extract_request_content / _extract_osl_fingerprint dispatch for Responses
  • _create_request_id prefix (resp-)
  • _encode_request_prompt_ids for both string and content-block input
  • tokenize_request handles Responses on the generation path

Reported by reviewer dynamo-ops.

Summary by CodeRabbit

  • New Features

    • Extended request handling support to a new request type with automatic input normalization across multiple input formats and enhanced token processing capabilities
  • Tests

    • Comprehensive test coverage added across multiple modules to validate the new request type functionality, including request identification generation, token extraction and processing, content normalization from diverse input shapes, field preservation, and proper handling of default values and edge cases

Review Change Stack

…bing

Introduce `ResponsesRequest` as a first-class member of `RequestT` so the
recorder can capture Responses-specific fields (`max_output_tokens`,
`reasoning_effort`, `stream`) instead of the synthetic
`ChatCompletionRequest` the `/v1/responses` handler currently builds for
the latency simulator. Subsequent commit will wire `make_ctx` to accept
a record-time override so handlers can pass the real payload to the
recorder while still driving simulation off the synthetic chat.

The model exposes a `prompt_text` property that flattens the Responses
`input` shape (str | list[str|dict] | list[content-block]) into a single
string. The flattener logic moved verbatim from `app._extract_responses_prompt`
into `models._flatten_responses_input` so the recorder and tokenizer
dispatch sites share one source of truth; the handler call site now uses
`req.prompt_text`.

JSONL schema decision: Responses' `max_output_tokens` is canonicalized
into the existing `max_completion_tokens` column rather than introducing
a new field. Both name the same semantic (the OSL cap); preserving the
JSONL schema is more useful for downstream tools than preserving the
API name-space, and the `endpoint` column on each row already disambiguates.

Dispatch wired in:
- `models.RequestT` union
- `tokens._extract_request_content` (text + cap)
- `tokens._extract_osl_fingerprint` (canonicalized fields)
- `utils._create_request_id` (`resp-{uuid}` prefix)
- `request_recorder._encode_request_prompt_ids` (tokenize via prompt_text)
- `app.responses` handler signature (`req: dict` -> `req: ResponsesRequest`)

Tests cover:
- prompt_text flattening across all four input shapes
- extras (`tools`, `instructions`) pass through via BaseModel extra="allow"
- _extract_request_content and _extract_osl_fingerprint dispatch
- _create_request_id prefix
- _encode_request_prompt_ids for string and content-block input
- tokenize_request handles Responses on the generation path

Reported by reviewer dynamo-ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3d656d2622e35b6a9eaaa9635be090366bf6fbba

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3d656d2622e35b6a9eaaa9635be090366bf6fbba

Last updated for commit: 3d656d2Browse code

@github-actions github-actions Bot added the feat label May 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Walkthrough

This PR adds ResponsesRequest Pydantic model support for OpenAI's /v1/responses API. The model flattens heterogeneous input shapes into a single prompt_text property, is registered in the RequestT union, and the request processing pipeline (endpoint handler, tokenization, token metrics, request ID generation) now dispatches on this new type with comprehensive test coverage.

Changes

ResponsesRequest Support

Layer / File(s) Summary
ResponsesRequest model definition with input flattening
tests/aiperf_mock_server/models.py
ResponsesRequest Pydantic model captures response-api fields and provides a prompt_text property that normalizes string/list/content-block inputs via _flatten_responses_input helper. Model is registered in the RequestT union.
Request type registration and imports
tests/aiperf_mock_server/app.py, tests/aiperf_mock_server/request_recorder.py, tests/aiperf_mock_server/tokens.py, tests/aiperf_mock_server/utils.py
ResponsesRequest is imported into each request-processing module so dispatch logic can recognize and handle this new request type.
Responses endpoint handler implementation
tests/aiperf_mock_server/app.py
Endpoint signature changes from req: dict[str, Any] to req: ResponsesRequest. Removes _extract_responses_prompt helper and builds chat completion messages directly from req.model and req.prompt_text.
Request processing pipeline support
tests/aiperf_mock_server/request_recorder.py, tests/aiperf_mock_server/tokens.py, tests/aiperf_mock_server/utils.py
Request recorder tokenizes prompt_text via tokenizer call mode; tokens module extracts prompt and output-token cap, and maps max_output_tokens to max_completion_tokens in OSL fingerprint; utils generates resp--prefixed request IDs.
ResponsesRequest model and behavior tests
tests/unit/server/test_models.py
TestResponsesRequest verifies prompt-text flattening across multiple input shapes, unmodeled field preservation via Pydantic extras, and safe field defaults.
Request processing dispatch tests
tests/unit/aiperf_mock_server/test_request_recorder.py, tests/unit/server/test_tokens.py
TestResponsesRequestRecorderDispatch tests request ID prefixing and tokenization with flattened content blocks; TestResponsesRequestDispatch validates content extraction from nested inputs, fingerprint field canonicalization, and token count constraints.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops through responses with glee,
Input shapes flattened with care—
Prompt text joins threads like a spree,
Tokenization, fingerprints fair!
New request type, tested with flair. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely describes the main change: introducing a new ResponsesRequest model with complete integration throughout the mock-server codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/aiperf_mock_server/models.py (1)

242-242: 💤 Low value

Consider more specific type hint for input field.

The input field is typed as str | list[Any], but based on _flatten_responses_input logic (lines 265-283), the actual expected shapes are more specific: strings, lists of strings, or lists of dicts with content fields. Consider narrowing to str | list[str | dict[str, Any]] for better type safety.

♻️ Proposed type refinement
-    input: str | list[Any] = ""
+    input: str | list[str | dict[str, Any]] = ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/aiperf_mock_server/models.py` at line 242, The current model field
named input is too broad (str | list[Any]); update its type to a more specific
union to match _flatten_responses_input expectations: use str | list[str |
dict[str, Any]] (ensure Any is imported from typing or typing_extensions
depending on project) so the field accepts plain strings, lists of strings, or
lists of dicts with content keys; update the type annotation for the input field
and run type checks to verify compatibility with the _flatten_responses_input
function and any serializers/deserializers that consume this model.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/aiperf_mock_server/models.py`:
- Around line 231-250: ResponsesRequest's Pydantic fields lack Field(...)
descriptions; update each field in the ResponsesRequest class (model, input,
max_output_tokens, stream, reasoning_effort, min_tokens, ignore_eos) to use
Field(..., description="...") with concise descriptions matching their purpose,
e.g. Field(default="", description="prompt input as string or list...") for
input and appropriate defaults for others, and ensure Field is imported from
pydantic if not already.

---

Nitpick comments:
In `@tests/aiperf_mock_server/models.py`:
- Line 242: The current model field named input is too broad (str | list[Any]);
update its type to a more specific union to match _flatten_responses_input
expectations: use str | list[str | dict[str, Any]] (ensure Any is imported from
typing or typing_extensions depending on project) so the field accepts plain
strings, lists of strings, or lists of dicts with content keys; update the type
annotation for the input field and run type checks to verify compatibility with
the _flatten_responses_input function and any serializers/deserializers that
consume this model.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca885e47-196d-495a-a0ae-934516df7bb5

📥 Commits

Reviewing files that changed from the base of the PR and between 03167d0 and f26c562.

📒 Files selected for processing (8)
  • tests/aiperf_mock_server/app.py
  • tests/aiperf_mock_server/models.py
  • tests/aiperf_mock_server/request_recorder.py
  • tests/aiperf_mock_server/tokens.py
  • tests/aiperf_mock_server/utils.py
  • tests/unit/aiperf_mock_server/test_request_recorder.py
  • tests/unit/server/test_models.py
  • tests/unit/server/test_tokens.py

Comment on lines +231 to +250
class ResponsesRequest(BaseModel):
"""Request model for OpenAI's /v1/responses endpoint.

The Responses API takes its prompt under `input` (which may be a string,
a list of strings, or a list of content-block dicts) and caps generation
via `max_output_tokens` rather than the chat API's `max_completion_tokens`.
Modeled here so the recorder can capture the real payload instead of the
synthetic ChatCompletionRequest the latency simulator drives off of.
"""

model: str
input: str | list[Any] = ""
max_output_tokens: int | None = None
stream: bool = False
reasoning_effort: Literal["low", "medium", "high"] | None = None

# Mirrors BaseCompletionRequest so recorder/simulator share field semantics
# when the client supplies them via extras.
min_tokens: int | None = None
ignore_eos: bool = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add Field descriptions to all Pydantic fields.

All fields in ResponsesRequest lack Field(description="...") annotations. As per coding guidelines, every Pydantic field must include a description.

📝 Proposed fix to add Field descriptions
+from pydantic import Field
+
 class ResponsesRequest(BaseModel):
     """Request model for OpenAI's /v1/responses endpoint.
 
     The Responses API takes its prompt under `input` (which may be a string,
     a list of strings, or a list of content-block dicts) and caps generation
     via `max_output_tokens` rather than the chat API's `max_completion_tokens`.
     Modeled here so the recorder can capture the real payload instead of the
     synthetic ChatCompletionRequest the latency simulator drives off of.
     """
 
-    model: str
-    input: str | list[Any] = ""
-    max_output_tokens: int | None = None
-    stream: bool = False
-    reasoning_effort: Literal["low", "medium", "high"] | None = None
+    model: str = Field(description="Model identifier for the Responses API endpoint")
+    input: str | list[Any] = Field(
+        default="",
+        description="Prompt input: string, list of strings, or list of content-block dicts",
+    )
+    max_output_tokens: int | None = Field(
+        default=None,
+        description="Maximum number of tokens to generate in the completion",
+    )
+    stream: bool = Field(
+        default=False,
+        description="Whether to stream the response as server-sent events",
+    )
+    reasoning_effort: Literal["low", "medium", "high"] | None = Field(
+        default=None,
+        description="Reasoning effort level for extended thinking models",
+    )
 
     # Mirrors BaseCompletionRequest so recorder/simulator share field semantics
     # when the client supplies them via extras.
-    min_tokens: int | None = None
-    ignore_eos: bool = False
+    min_tokens: int | None = Field(
+        default=None,
+        description="Minimum number of tokens to generate before allowing EOS",
+    )
+    ignore_eos: bool = Field(
+        default=False,
+        description="Whether to ignore end-of-sequence tokens during generation",
+    )

As per coding guidelines: "Add Field(description="...") on EVERY Pydantic field".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/aiperf_mock_server/models.py` around lines 231 - 250,
ResponsesRequest's Pydantic fields lack Field(...) descriptions; update each
field in the ResponsesRequest class (model, input, max_output_tokens, stream,
reasoning_effort, min_tokens, ignore_eos) to use Field(..., description="...")
with concise descriptions matching their purpose, e.g. Field(default="",
description="prompt input as string or list...") for input and appropriate
defaults for others, and ensure Field is imported from pydantic if not already.

mock_req = ChatCompletionRequest(
model=model,
messages=[{"role": "user", "content": _extract_responses_prompt(req)}],
messages=[{"role": "user", "content": req.prompt_text}],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Responses handler still builds the request context from a synthetic ChatCompletionRequest, so real /v1/responses calls never exercise the new ResponsesRequest recorder/tokenizer/request-id dispatch and drop max_output_tokens, min_tokens, ignore_eos, and reasoning_effort. Fix: build the context from the parsed ResponsesRequest instead.

🤖 AI Fix

In tests/aiperf_mock_server/app.py, update responses() to call make_ctx(req, endpoint, request.state.start_time) and remove the mock_req ChatCompletionRequest construction so ResponsesRequest drives tokenization, request IDs, and request recording.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@FrankD412
Copy link
Copy Markdown
Contributor Author

Request distribution (1000 requests)
──────────────────────────────────────────────
  Definitions
    ISL/OSL: input/requested output sequence length in tokens; OSL is the request cap, not generated output.
    Vocab used: unique token IDs observed / tokenizer vocab size.
    top-10 cover: share of prompt tokens from the 10 most common token IDs.
    entropy: token-id diversity; higher means broader prompt vocabulary use.
    top decoded tokens: most frequent token IDs decoded for sanity checks; tokens are not words.
    vocab shape: log-scaled 80-bucket view across token-id space.
    vocab shape stats: mean/percentiles of prompt-token counts per bucket, including empty buckets.

  /v1/chat/completions  n=1000
    ISL            mean  1121.1   min    37   max  2527   p50  1138   p99  2201
    Requested OSL  mean   128.0   min   128   max   128   p50   128   p99   128

    ISL histogram (25 bins, n=1000, 769 unique)
        37-  137   10 ██░░░░░░░░░░░░░░░░░░
       137-  236   15 ███░░░░░░░░░░░░░░░░░
       236-  336   23 █████░░░░░░░░░░░░░░░
       336-  435   39 ████████░░░░░░░░░░░░
       435-  535   40 ████████░░░░░░░░░░░░
       535-  635   43 █████████░░░░░░░░░░░
       635-  734   51 ██████████░░░░░░░░░░
       734-  834   61 ████████████░░░░░░░░
       834-  933   71 ██████████████░░░░░░
       933- 1033   68 ██████████████░░░░░░
      1033- 1133   73 ███████████████░░░░░
      1133- 1232  100 ████████████████████
      1232- 1332   86 █████████████████░░░
      1332- 1431   68 ██████████████░░░░░░
      1431- 1531   57 ███████████░░░░░░░░░
      1531- 1631   48 ██████████░░░░░░░░░░
      1631- 1730   39 ████████░░░░░░░░░░░░
      1730- 1830   30 ██████░░░░░░░░░░░░░░
      1830- 1929   32 ██████░░░░░░░░░░░░░░
      1929- 2029   17 ███░░░░░░░░░░░░░░░░░
      2029- 2129    9 ██░░░░░░░░░░░░░░░░░░
      2129- 2228   13 ███░░░░░░░░░░░░░░░░░
      2228- 2328    3 █░░░░░░░░░░░░░░░░░░░
      2328- 2427    3 █░░░░░░░░░░░░░░░░░░░
      2427- 2527    1 ░░░░░░░░░░░░░░░░░░░░

    Requested OSL histogram (1 bins, n=1000, 1 unique)
      128- 128  1000 ████████████████████


    Vocab  used 18223/128000 (14.2%)  top-10 cover 13%  entropy 10.7/17.0 bits
      top decoded tokens: " the" 24285, " I" 20910, " and" 18940, " to" 16151, " of" 16115

    vocab shape  (80 buckets over id 0..127999, log-y)

      bucket tokens mean 13900.7   p50  2380   p90 15189   p95 34966   p99 212801

    ██▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▄▁▃▁▃▂▂▂▂▂▁▂▂▃▃▂▂▁
    0                   32K                 64K                 96K             128K

This is confirming that we haven't messed up the ISL/OSL distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants