feat(mock-server): add ResponsesRequest model with full dispatch plumbing#1000
feat(mock-server): add ResponsesRequest model with full dispatch plumbing#1000FrankD412 wants to merge 2 commits into
Conversation
…bing
Introduce `ResponsesRequest` as a first-class member of `RequestT` so the
recorder can capture Responses-specific fields (`max_output_tokens`,
`reasoning_effort`, `stream`) instead of the synthetic
`ChatCompletionRequest` the `/v1/responses` handler currently builds for
the latency simulator. Subsequent commit will wire `make_ctx` to accept
a record-time override so handlers can pass the real payload to the
recorder while still driving simulation off the synthetic chat.
The model exposes a `prompt_text` property that flattens the Responses
`input` shape (str | list[str|dict] | list[content-block]) into a single
string. The flattener logic moved verbatim from `app._extract_responses_prompt`
into `models._flatten_responses_input` so the recorder and tokenizer
dispatch sites share one source of truth; the handler call site now uses
`req.prompt_text`.
JSONL schema decision: Responses' `max_output_tokens` is canonicalized
into the existing `max_completion_tokens` column rather than introducing
a new field. Both name the same semantic (the OSL cap); preserving the
JSONL schema is more useful for downstream tools than preserving the
API name-space, and the `endpoint` column on each row already disambiguates.
Dispatch wired in:
- `models.RequestT` union
- `tokens._extract_request_content` (text + cap)
- `tokens._extract_osl_fingerprint` (canonicalized fields)
- `utils._create_request_id` (`resp-{uuid}` prefix)
- `request_recorder._encode_request_prompt_ids` (tokenize via prompt_text)
- `app.responses` handler signature (`req: dict` -> `req: ResponsesRequest`)
Tests cover:
- prompt_text flattening across all four input shapes
- extras (`tools`, `instructions`) pass through via BaseModel extra="allow"
- _extract_request_content and _extract_osl_fingerprint dispatch
- _create_request_id prefix
- _encode_request_prompt_ids for string and content-block input
- tokenize_request handles Responses on the generation path
Reported by reviewer dynamo-ops.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Frank Di Natale <[email protected]>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3d656d2622e35b6a9eaaa9635be090366bf6fbbaRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3d656d2622e35b6a9eaaa9635be090366bf6fbbaLast updated for commit: |
WalkthroughThis PR adds ChangesResponsesRequest Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/aiperf_mock_server/models.py (1)
242-242: 💤 Low valueConsider more specific type hint for input field.
The
inputfield is typed asstr | list[Any], but based on_flatten_responses_inputlogic (lines 265-283), the actual expected shapes are more specific: strings, lists of strings, or lists of dicts with content fields. Consider narrowing tostr | list[str | dict[str, Any]]for better type safety.♻️ Proposed type refinement
- input: str | list[Any] = "" + input: str | list[str | dict[str, Any]] = ""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/aiperf_mock_server/models.py` at line 242, The current model field named input is too broad (str | list[Any]); update its type to a more specific union to match _flatten_responses_input expectations: use str | list[str | dict[str, Any]] (ensure Any is imported from typing or typing_extensions depending on project) so the field accepts plain strings, lists of strings, or lists of dicts with content keys; update the type annotation for the input field and run type checks to verify compatibility with the _flatten_responses_input function and any serializers/deserializers that consume this model.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/aiperf_mock_server/models.py`:
- Around line 231-250: ResponsesRequest's Pydantic fields lack Field(...)
descriptions; update each field in the ResponsesRequest class (model, input,
max_output_tokens, stream, reasoning_effort, min_tokens, ignore_eos) to use
Field(..., description="...") with concise descriptions matching their purpose,
e.g. Field(default="", description="prompt input as string or list...") for
input and appropriate defaults for others, and ensure Field is imported from
pydantic if not already.
---
Nitpick comments:
In `@tests/aiperf_mock_server/models.py`:
- Line 242: The current model field named input is too broad (str | list[Any]);
update its type to a more specific union to match _flatten_responses_input
expectations: use str | list[str | dict[str, Any]] (ensure Any is imported from
typing or typing_extensions depending on project) so the field accepts plain
strings, lists of strings, or lists of dicts with content keys; update the type
annotation for the input field and run type checks to verify compatibility with
the _flatten_responses_input function and any serializers/deserializers that
consume this model.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ca885e47-196d-495a-a0ae-934516df7bb5
📒 Files selected for processing (8)
tests/aiperf_mock_server/app.pytests/aiperf_mock_server/models.pytests/aiperf_mock_server/request_recorder.pytests/aiperf_mock_server/tokens.pytests/aiperf_mock_server/utils.pytests/unit/aiperf_mock_server/test_request_recorder.pytests/unit/server/test_models.pytests/unit/server/test_tokens.py
| class ResponsesRequest(BaseModel): | ||
| """Request model for OpenAI's /v1/responses endpoint. | ||
|
|
||
| The Responses API takes its prompt under `input` (which may be a string, | ||
| a list of strings, or a list of content-block dicts) and caps generation | ||
| via `max_output_tokens` rather than the chat API's `max_completion_tokens`. | ||
| Modeled here so the recorder can capture the real payload instead of the | ||
| synthetic ChatCompletionRequest the latency simulator drives off of. | ||
| """ | ||
|
|
||
| model: str | ||
| input: str | list[Any] = "" | ||
| max_output_tokens: int | None = None | ||
| stream: bool = False | ||
| reasoning_effort: Literal["low", "medium", "high"] | None = None | ||
|
|
||
| # Mirrors BaseCompletionRequest so recorder/simulator share field semantics | ||
| # when the client supplies them via extras. | ||
| min_tokens: int | None = None | ||
| ignore_eos: bool = False |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add Field descriptions to all Pydantic fields.
All fields in ResponsesRequest lack Field(description="...") annotations. As per coding guidelines, every Pydantic field must include a description.
📝 Proposed fix to add Field descriptions
+from pydantic import Field
+
class ResponsesRequest(BaseModel):
"""Request model for OpenAI's /v1/responses endpoint.
The Responses API takes its prompt under `input` (which may be a string,
a list of strings, or a list of content-block dicts) and caps generation
via `max_output_tokens` rather than the chat API's `max_completion_tokens`.
Modeled here so the recorder can capture the real payload instead of the
synthetic ChatCompletionRequest the latency simulator drives off of.
"""
- model: str
- input: str | list[Any] = ""
- max_output_tokens: int | None = None
- stream: bool = False
- reasoning_effort: Literal["low", "medium", "high"] | None = None
+ model: str = Field(description="Model identifier for the Responses API endpoint")
+ input: str | list[Any] = Field(
+ default="",
+ description="Prompt input: string, list of strings, or list of content-block dicts",
+ )
+ max_output_tokens: int | None = Field(
+ default=None,
+ description="Maximum number of tokens to generate in the completion",
+ )
+ stream: bool = Field(
+ default=False,
+ description="Whether to stream the response as server-sent events",
+ )
+ reasoning_effort: Literal["low", "medium", "high"] | None = Field(
+ default=None,
+ description="Reasoning effort level for extended thinking models",
+ )
# Mirrors BaseCompletionRequest so recorder/simulator share field semantics
# when the client supplies them via extras.
- min_tokens: int | None = None
- ignore_eos: bool = False
+ min_tokens: int | None = Field(
+ default=None,
+ description="Minimum number of tokens to generate before allowing EOS",
+ )
+ ignore_eos: bool = Field(
+ default=False,
+ description="Whether to ignore end-of-sequence tokens during generation",
+ )As per coding guidelines: "Add Field(description="...") on EVERY Pydantic field".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/aiperf_mock_server/models.py` around lines 231 - 250,
ResponsesRequest's Pydantic fields lack Field(...) descriptions; update each
field in the ResponsesRequest class (model, input, max_output_tokens, stream,
reasoning_effort, min_tokens, ignore_eos) to use Field(..., description="...")
with concise descriptions matching their purpose, e.g. Field(default="",
description="prompt input as string or list...") for input and appropriate
defaults for others, and ensure Field is imported from pydantic if not already.
| mock_req = ChatCompletionRequest( | ||
| model=model, | ||
| messages=[{"role": "user", "content": _extract_responses_prompt(req)}], | ||
| messages=[{"role": "user", "content": req.prompt_text}], |
There was a problem hiding this comment.
The Responses handler still builds the request context from a synthetic ChatCompletionRequest, so real /v1/responses calls never exercise the new ResponsesRequest recorder/tokenizer/request-id dispatch and drop max_output_tokens, min_tokens, ignore_eos, and reasoning_effort. Fix: build the context from the parsed ResponsesRequest instead.
🤖 AI Fix
In tests/aiperf_mock_server/app.py, update responses() to call make_ctx(req, endpoint, request.state.start_time) and remove the mock_req ChatCompletionRequest construction so ResponsesRequest drives tokenization, request IDs, and request recording.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This is confirming that we haven't messed up the ISL/OSL distribution. |
Summary
Follow-up to #962. Introduces
ResponsesRequestas a first-class member ofRequestTso the mock-server request recorder can capture Responses-specific fields (max_output_tokens,reasoning_effort,stream) instead of the syntheticChatCompletionRequestthe/v1/responseshandler builds for the latency simulator.ResponsesRequestmodel with aprompt_textproperty that flattens the Responsesinputshape (str | list[str|dict] | list[content-block]) into a single string. Flattener logic moved verbatim fromapp._extract_responses_promptintomodels._flatten_responses_inputso recorder, tokenizer, and handler share one source of truth.models.RequestT,tokens._extract_request_content,tokens._extract_osl_fingerprint,utils._create_request_id(resp-{uuid}prefix),request_recorder._encode_request_prompt_ids, and theapp.responseshandler signature (req: dict->req: ResponsesRequest).max_output_tokensis canonicalized into the existingmax_completion_tokenscolumn rather than introducing a new field. Both name the same semantic (the OSL cap); preserving the JSONL schema is more useful for downstream tools than preserving the API name-space, and theendpointcolumn on each row already disambiguates.A subsequent commit will wire
make_ctxto accept a record-time override so handlers can pass the real payload to the recorder while still driving simulation off the synthetic chat.Test Plan
uv run pytest tests/unit/ -n auto(12881 passed; one unrelated MLflow flake passes in isolation)prompt_textflattening across all four input shapestools,instructions) pass through via BaseModelextra="allow"_extract_request_content/_extract_osl_fingerprintdispatch for Responses_create_request_idprefix (resp-)_encode_request_prompt_idsfor both string and content-block inputtokenize_requesthandles Responses on the generation pathReported by reviewer dynamo-ops.
Summary by CodeRabbit
New Features
Tests