feat(memory): add STM condensation scheduling (STMO)#1400
feat(memory): add STM condensation scheduling (STMO)#1400spalimpaaces-star wants to merge 19 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds per-request short-term memory (STM) gating, STM condenser handling, a configurable conversation-history cap, STMO scheduling/enqueueing during persistence, consolidation of persistence handles, and threads memory execution state plus conversation-turn metadata through routing, history loading, and persistence flows. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Router as GrpcRouter
participant Context as ResponsesContext
participant Persistence as PersistenceHandles
participant Queue as ConversationMemoryQueue
Client->>Router: request + headers
Router->>Context: build per-request MemoryExecutionContext (stm_enabled)
Context->>Context: load history (limit = max_conversation_history_items [+1 if STM])
Context->>Persistence: persist_response_if_needed(..., memory_execution_context, conversation_turn_info)
Persistence->>Persistence: persist conversation items
alt STM enabled & STMO boundary met
Persistence->>Queue: enqueue NewConversationMemory { Stmo, Ready, condenser_model? }
Queue-->>Persistence: ack / error (errors ignored)
end
Persistence-->>Router: persist result
Router-->>Client: response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @spalimpaaces-star, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
There was a problem hiding this comment.
Code Review
This pull request introduces Short-Term Memory (STM) condensation (STMO) functionality, enabling the gateway to trigger background condensation jobs at specific conversation boundaries (turns 4, 7, 10, etc.). It adds a configurable max_conversation_history_items setting, refactors the persistence layer to use a bundled PersistenceHandles struct, and propagates a per-request MemoryExecutionContext through the gRPC and OpenAI router pipelines. Feedback identifies critical logic issues in the STMO trigger implementation: the turn and item counts are currently calculated relative to the current request's input rather than the absolute conversation state. This will prevent STMO from firing correctly for managed conversations and provide incorrect indices to the condensation worker.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 859aa814b9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Replace hardcoded MAX_CONVERSATION_HISTORY_ITEMS=100 with a configurable max_conversation_history_items field in RouterConfig. Defaults to 100 so existing deployments are unaffected. Validated at startup to reject zero. Threaded through GrpcRouter, ResponsesContext, and OpenAI history path. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
… after persistence - Add stm_enabled + stm_condenser_model_id to MemoryHeaderView and MemoryExecutionContext; gate stm_enabled by runtime.enabled so STM and LTM honour the same runtime memory switch - Add memory_execution_context and max_conversation_history_items to ResponsesContext; build a fresh per-request context from headers in route_responses_impl for both Harmony and regular gRPC paths - Store memory_runtime_config on GrpcRouter for per-request header parsing without re-reading router config on every call - Thread conversation_memory_writer + memory_execution_context through persist_response_if_needed and persist_conversation_items across all gRPC and OpenAI response paths (streaming + non-streaming) - After link_items_to_conversation succeeds, count user_turns from in-memory input_items (role eq_ignore_ascii_case "user") and total_items = input + output; enqueue STMO when user_turns >= 4 && (user_turns - 1) % 3 == 0 - Enqueue runs best-effort in a tokio::spawn with tracing span propagation so it never blocks or fails the response path Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
- stmo_fires_at_boundary_turns: verifies trigger at 4, 7, 10 and absence between boundaries - count_user_turns_case_insensitive: role matching is case-insensitive and items without a role field are excluded - total_items_exceeds_user_turns: asserts target_item_end reflects all items (input + output), not just user turns Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
Bundle the four storage backends (response, conversation, conversation_item, conversation_memory_writer) into a PersistenceHandles struct on ResponsesContext, mirroring the pattern from PR lightseekorg#1357 so both code paths stay consistent and future merges remain clean. - Add PersistenceHandles (Clone) to grpc/common/responses/context.rs - Replace four individual fields on ResponsesContext with persistence: PersistenceHandles - Reduce ResponsesContext::new() from 10 to 7 args (drops too_many_arguments suppress) - Update persist_response_if_needed to take (&PersistenceHandles, MemoryExecutionContext) instead of four individual storage handles — drops from 8 to 5 args - Update all gRPC call sites: harmony/non_streaming, harmony/streaming (x2), regular/non_streaming, regular/streaming - Update direct storage field accesses in handlers.rs, harmony/common.rs, regular/common.rs, and router.rs Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/common/persistence_utils.rs`:
- Around line 385-394: Replace the request-local counting logic (the
count_user_turns helper and any uses that compute target_item_end from
input_items/output_items) with a conversation-level count obtained from
ConversationItemStorage::count_items_and_user_turns(...) after
link_items_to_conversation completes; specifically, call the storage method to
get total items and user_turns and derive target_item_end from those
conversation-level counts, and preserve the current best-effort skip/fallback
behavior when the storage reports STMO as unsupported (handle the Result/Option
the storage returns and only enqueue when the storage indicates support). Update
both the place where count_user_turns is used and the other similar block
(currently computing target_item_end) to use the storage API and remove reliance
on request-local input/output slices.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9c30252d-478e-4b58-b821-82944944a522
📒 Files selected for processing (20)
model_gateway/src/config/builder.rsmodel_gateway/src/config/types.rsmodel_gateway/src/config/validation.rsmodel_gateway/src/memory/context.rsmodel_gateway/src/routers/common/header_utils.rsmodel_gateway/src/routers/common/persistence_utils.rsmodel_gateway/src/routers/grpc/common/responses/context.rsmodel_gateway/src/routers/grpc/common/responses/handlers.rsmodel_gateway/src/routers/grpc/common/responses/mod.rsmodel_gateway/src/routers/grpc/common/responses/utils.rsmodel_gateway/src/routers/grpc/harmony/responses/common.rsmodel_gateway/src/routers/grpc/harmony/responses/non_streaming.rsmodel_gateway/src/routers/grpc/harmony/responses/streaming.rsmodel_gateway/src/routers/grpc/regular/responses/common.rsmodel_gateway/src/routers/grpc/regular/responses/non_streaming.rsmodel_gateway/src/routers/grpc/regular/responses/streaming.rsmodel_gateway/src/routers/grpc/router.rsmodel_gateway/src/routers/openai/responses/history.rsmodel_gateway/src/routers/openai/responses/non_streaming.rsmodel_gateway/src/routers/openai/responses/streaming.rs
The DB write for the STMO job is lightweight (single INSERT) and should be awaited inline rather than spawned fire-and-forget. Removes the tokio::spawn + .instrument() wrapper and calls handle_stmo_after_persist directly within persist_conversation_items_inner. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
859aa81 to
d6d6960
Compare
Introduce ConversationTurnInfo { user_turns, total_items } computed
from the fully-assembled input (history + new turn) at load time, and
thread it through every persistence call site across OpenAI and gRPC
paths (harmony + regular, streaming + non-streaming).
Previously, handle_stmo_after_persist derived user_turns and
target_item_end from original_body.input, which for managed
conversations only contains the latest user message. This caused
should_enqueue_stmo to always see user_turns=1, so the 4/7/10
boundaries were never reached and STMO jobs were never enqueued.
count_conversation_turn_info is now called on modified_request.input
after load_conversation_history / load_previous_messages /
load_input_history assembles the full conversation, giving both
values their correct absolute position within the conversation.
Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
Mirrors the pattern introduced in PR lightseekorg#1357 — clones the context while swapping in a request-scoped StorageRequestContext, removing the need for duplicated constructor wiring at call sites that attach per-request storage hook context. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
…context - Remove with_request_context() from ResponsesContext — no current call site exists since per-request builds also swap memory_execution_context; will be re-added after PR lightseekorg#1357 rebase if still needed - Add #[expect(clippy::too_many_arguments)] to persist_conversation_items_inner and execute_mcp_tool_loop_streaming to satisfy -D warnings Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
Replace stm_enabled: bool in MemoryExecutionContext with MemoryExecutionState, matching the pattern used for store_ltm and recall. This lets call sites distinguish three states: NotRequested — client did not send the STM flag GatedOff — client requested STM but runtime.enabled = false Active — client requested STM and runtime allows it Previously the bool collapsed NotRequested and GatedOff to false, making it impossible to observe when STM was silently disabled at the server level. All call sites updated to use .active(). Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79aca7fb8f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/routers/grpc/regular/responses/streaming.rs (1)
430-437:⚠️ Potential issue | 🟠 Major
conversation_turn_infois dropped in MCP streaming, blocking STMO scheduling on that path.At Line 436 the parameter is renamed to
_conversation_turn_infoand then not forwarded at Line 451-Line 458. This means MCP streaming requests cannot carry turn metadata into persistence/STMO logic, even though callers now compute and pass it.Suggested direction
pub(super) fn execute_tool_loop_streaming( ctx: &ResponsesContext, current_request: ResponsesRequest, original_request: &ResponsesRequest, params: ResponsesCallContext, mcp_servers: Vec<McpServerBinding>, - _conversation_turn_info: Option<ConversationTurnInfo>, + conversation_turn_info: Option<ConversationTurnInfo>, ) -> Response { @@ let result = execute_tool_loop_streaming_internal( &ctx_clone, current_request, &original_request_clone, params, mcp_servers, + conversation_turn_info, tx.clone(), ) .await;Then thread
conversation_turn_infointo the internal MCP completion/persistence point the same way non-MCP streaming already does.Also applies to: 451-458
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/responses/streaming.rs` around lines 430 - 437, The parameter conversation_turn_info is being dropped in execute_tool_loop_streaming (renamed to _conversation_turn_info) and not forwarded into the MCP streaming completion/persistence path; restore and thread the ConversationTurnInfo through the MCP-specific completion/persistence call the same way non-MCP streaming does by removing the underscore from the parameter name and passing that variable into the internal MCP completion/persistence invocation (the function/method where MCP responses are finalized/persisted) so STMO scheduling and persistence receive the turn metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/grpc/regular/responses/common.rs`:
- Around line 372-381: The current turn_info is computed from
modified_request.input which only contains rehydrated "message" rows and misses
function_call, function_call_output, reasoning, and MCP items; update the logic
so count_conversation_turn_info is run against the fully assembled ResponseInput
(i.e., after the conversation history/replay has been loaded and merged) or
alternatively supply a separate total_items value that includes the skipped
rows; specifically adjust the flow around LoadedRequest construction
(modified_request and turn_info) so ConversationTurnInfo { user_turns,
total_items } comes from the complete replayed input rather than the partial
modified_request.input to avoid undersized target_item_end in STMO.
In `@model_gateway/src/routers/openai/responses/history.rs`:
- Around line 240-244: The STMO turn counts are being computed from the filtered
request_body.input (which omits persisted reasoning/tool rows), so call
count_conversation_turn_info on the fully assembled replay/response input after
conversation history is loaded (not on request_body.input), or alternatively
build ConversationTurnInfo using the absolute persisted total_items (including
skipped rows) plus user_turns from the assembled input; update the stm_enabled
branch that currently computes conversation_turn_info to use that assembled
ResponseInput (or the persisted total) so ConversationTurnInfo.total_items
reflects the true conversation size.
---
Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/responses/streaming.rs`:
- Around line 430-437: The parameter conversation_turn_info is being dropped in
execute_tool_loop_streaming (renamed to _conversation_turn_info) and not
forwarded into the MCP streaming completion/persistence path; restore and thread
the ConversationTurnInfo through the MCP-specific completion/persistence call
the same way non-MCP streaming does by removing the underscore from the
parameter name and passing that variable into the internal MCP
completion/persistence invocation (the function/method where MCP responses are
finalized/persisted) so STMO scheduling and persistence receive the turn
metadata.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2078e788-90fb-4fbe-b70a-59ea9f8da01d
📒 Files selected for processing (23)
model_gateway/src/config/builder.rsmodel_gateway/src/config/types.rsmodel_gateway/src/config/validation.rsmodel_gateway/src/memory/context.rsmodel_gateway/src/routers/common/header_utils.rsmodel_gateway/src/routers/common/persistence_utils.rsmodel_gateway/src/routers/grpc/common/responses/context.rsmodel_gateway/src/routers/grpc/common/responses/handlers.rsmodel_gateway/src/routers/grpc/common/responses/mod.rsmodel_gateway/src/routers/grpc/common/responses/utils.rsmodel_gateway/src/routers/grpc/harmony/responses/common.rsmodel_gateway/src/routers/grpc/harmony/responses/non_streaming.rsmodel_gateway/src/routers/grpc/harmony/responses/streaming.rsmodel_gateway/src/routers/grpc/regular/responses/common.rsmodel_gateway/src/routers/grpc/regular/responses/handlers.rsmodel_gateway/src/routers/grpc/regular/responses/non_streaming.rsmodel_gateway/src/routers/grpc/regular/responses/streaming.rsmodel_gateway/src/routers/grpc/router.rsmodel_gateway/src/routers/openai/context.rsmodel_gateway/src/routers/openai/responses/history.rsmodel_gateway/src/routers/openai/responses/non_streaming.rsmodel_gateway/src/routers/openai/responses/route.rsmodel_gateway/src/routers/openai/responses/streaming.rs
…on path The gRPC and OpenAI conversation loaders filter stored items before assembling the inference window (gRPC keeps only messages; OpenAI also keeps function_call/function_call_output but drops reasoning). count_conversation_turn_info on the filtered input undercounts total_items for any conversation with prior tool or reasoning turns, causing target_item_end to point at the wrong range. Fix: capture stored_items.len() from list_items before filtering and use raw_count + current_input_count as total_items, giving the correct absolute position in the conversation regardless of which item types the inference window excludes. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
When stm_enabled is active, request max_conversation_history_items + 1 from list_items. If cap + 1 items are returned the conversation has grown past the configured ceiling, which would corrupt STMO turn-count math. Both the gRPC regular and OpenAI Responses paths now return a 400 conversation_too_large error in that case, with no overhead for deployments not using STMO. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
model_gateway/src/routers/openai/responses/history.rs (1)
259-261:⚠️ Potential issue | 🟠 MajorSkip STMO scheduling when conversation history falls back.
If
list_itemsfails here, the request intentionally proceeds without loaded history, but thestm_enabledblock still buildsConversationTurnInfofrom the unhydratedrequest_body.input. That undercountsuser_turnsandtotal_items, so persistence can enqueue STMO on the wrong boundary or with the wrongtarget_item_endafter a transient history-store failure. ReturnNoneforconversation_turn_infowhen a conversation-backed history load failed, and let this request skip STMO instead.Suggested fix
- let conversation_turn_info = if stm_enabled { + let conversation_history_loaded = conversation.is_none() || raw_stored_item_count.is_some(); + + let conversation_turn_info = if stm_enabled && conversation_history_loaded { let mut info = count_conversation_turn_info(&request_body.input); // Correct total_items when loaded from conversation storage: reasoning // items are skipped during load but are stored in the DB, so the // assembled input underestimates the true conversation size. Use the // raw DB count + current input item count for an accurate target_item_end. if let Some(raw_count) = raw_stored_item_count { info.total_items = raw_count + current_input_count; } Some(info) } else { None };Based on learnings:
load_input_historyintentionally uses graceful degradation when storage operations fail, andcount_conversation_turn_info(input: &ResponseInput)is designed for callers that pass the fully assembled absolute conversation input.Also applies to: 275-287
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/openai/responses/history.rs` around lines 259 - 261, The error branch that just warns on failed history load should also null out the computed ConversationTurnInfo so STMO logic doesn't use unhydrated request input; in the Err(e) arm where you log "Failed to load conversation history", set conversation_turn_info = None (or otherwise return None from the load_input_history caller) instead of leaving a stale value, and ensure the stm_enabled block checks that conversation_turn_info is Some before calling count_conversation_turn_info or using user_turns/total_items; reference load_input_history, conversation_turn_info, ConversationTurnInfo, count_conversation_turn_info, and stm_enabled when making this change so callers skip STMO when history load falls back.model_gateway/src/routers/grpc/regular/responses/common.rs (1)
358-360:⚠️ Potential issue | 🟡 MinorSkip STMO turn-info generation when conversation history load degrades.
When
list_itemsfails (Line 358), the code still computesturn_infofrommodified_request.input(Line 398), which may only contain request-local items. That can produce non-absolute counts for STMO scheduling in degraded mode.🔧 Proposed fix
pub(super) async fn load_conversation_history( ctx: &ResponsesContext, request: &ResponsesRequest, stm_enabled: bool, ) -> Result<LoadedRequest, Response> { let mut modified_request = request.clone(); let mut conversation_items: Option<Vec<ResponseInputOutputItem>> = None; + let mut conversation_history_loaded = false; // Tracks the raw DB item count (all types) for the conversation path so // that total_items in ConversationTurnInfo is not undercounted when // function_call/function_call_output/MCP items are present in storage // but filtered out of the inference window. let mut raw_stored_item_count: Option<usize> = None; @@ match ctx .persistence .conversation_item_storage .list_items(&conv_id, params) .await { Ok(stored_items) => { + conversation_history_loaded = true; if stm_enabled && stored_items.len() > cap { return Err(error::bad_request( "conversation_too_large", @@ let turn_info = if stm_enabled { + if request.conversation.is_some() && !conversation_history_loaded { + None + } else { let mut info = count_conversation_turn_info(&modified_request.input); // If we loaded from conversation storage, total_items from the // assembled (message-only) input undercounts — function_call, // function_call_output, and MCP items are in the DB but filtered // out of the inference window. Use the raw DB count instead so // target_item_end points at the correct absolute position. if let Some(raw_count) = raw_stored_item_count { let current_input_count = match &request.input { ResponseInput::Text(_) => 1, ResponseInput::Items(items) => items.len(), }; info.total_items = raw_count + current_input_count; } Some(info) + } } else { None };Based on learnings:
count_conversation_turn_info(input: &ResponseInput)/ConversationTurnInfoare intended to run on a fully assembledResponseInputso counts stay absolute.Also applies to: 398-413
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/responses/common.rs` around lines 358 - 360, When list_items fails while loading conversation history (the Err(e) branch), avoid computing turn_info from the partially assembled modified_request.input; instead record that history load degraded and skip/leave ConversationTurnInfo unset. Specifically, in the block handling list_items errors, set a flag or option (e.g., history_load_ok = false or turn_info = None) and only call count_conversation_turn_info(&modified_request.input) and populate ConversationTurnInfo when history_load_ok is true (i.e., when list_items succeeded). Ensure any downstream uses expect an absent/None turn_info when history load degraded so STMO scheduling uses only absolute counts computed from full history.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/responses/common.rs`:
- Around line 358-360: When list_items fails while loading conversation history
(the Err(e) branch), avoid computing turn_info from the partially assembled
modified_request.input; instead record that history load degraded and skip/leave
ConversationTurnInfo unset. Specifically, in the block handling list_items
errors, set a flag or option (e.g., history_load_ok = false or turn_info = None)
and only call count_conversation_turn_info(&modified_request.input) and populate
ConversationTurnInfo when history_load_ok is true (i.e., when list_items
succeeded). Ensure any downstream uses expect an absent/None turn_info when
history load degraded so STMO scheduling uses only absolute counts computed from
full history.
In `@model_gateway/src/routers/openai/responses/history.rs`:
- Around line 259-261: The error branch that just warns on failed history load
should also null out the computed ConversationTurnInfo so STMO logic doesn't use
unhydrated request input; in the Err(e) arm where you log "Failed to load
conversation history", set conversation_turn_info = None (or otherwise return
None from the load_input_history caller) instead of leaving a stale value, and
ensure the stm_enabled block checks that conversation_turn_info is Some before
calling count_conversation_turn_info or using user_turns/total_items; reference
load_input_history, conversation_turn_info, ConversationTurnInfo,
count_conversation_turn_info, and stm_enabled when making this change so callers
skip STMO when history load falls back.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d656d7c9-bfce-47ac-90e5-22c163660165
📒 Files selected for processing (2)
model_gateway/src/routers/grpc/regular/responses/common.rsmodel_gateway/src/routers/openai/responses/history.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09c5213c1c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
When list_items returns an error the assembled input only contains the current request, so turn counts would be wrong. Guard both the OpenAI and gRPC regular conversation paths: if a conversation was requested but raw_stored_item_count is still None after the match, return None for conversation_turn_info so persistence skips STMO entirely instead of enqueueing a job with a drifted last_index/target_item_end. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/grpc/regular/responses/common.rs`:
- Around line 398-423: The override of info.total_items using
raw_stored_item_count + current_input_count can drop replayed response-chain
items that are already present in modified_request.input (when
previous_response_id was loaded); change the logic in the turn_info block so you
only replace info.total_items with raw_count + current_input_count when
request.previous_response_id is None (i.e., no replayed response-chain was
added), otherwise preserve the total_items computed by
count_conversation_turn_info(&modified_request.input); refer to turn_info,
stm_enabled, request, raw_stored_item_count, modified_request.input,
count_conversation_turn_info, and total_items to locate and update the
conditional.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d9b30e52-4828-4f0d-94f9-0708a4cc19c0
📒 Files selected for processing (2)
model_gateway/src/routers/grpc/regular/responses/common.rsmodel_gateway/src/routers/openai/responses/history.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f1dec4da8d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…loaded When previous_response_id is absent and request.conversation is set, load_previous_messages returns early without loading conversation history. Computing turn_info from request.input alone in that case gives wrong user_turns and total_items. Guard the early-return branch so turn_info is None whenever conversation is set but history was not loaded. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea0e87ef00
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
If a request somehow carries both previous_response_id and conversation, the chain-item merge runs last and overwrites modified_request.input with the full replayed history. Applying raw_stored_item_count + current_input on top of that would drop the chain items from total_items. Guard the correction with request.previous_response_id.is_none() in both the gRPC regular and OpenAI paths. The two fields are mutually exclusive per API contract, but the guard makes the function correct regardless. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7925e8f69a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The cap overflow check and the extra DB fetch (cap+1) were gated only on stm_enabled, so requests with store=false could fail with 400 even though persist_response_if_needed short-circuits before any STMO enqueue. Add request.store.unwrap_or(true) to both the fetch_limit calculation and the rejection condition in the gRPC regular and OpenAI history loaders so read-only calls are never affected by the STMO conversation size limit. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e7d765252
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The MCP streaming tool loop (execute_tool_loop_streaming) never calls persist_response_if_needed so STMO is never enqueued there. Applying the conversation_too_large cap check in that path rejects valid requests for no reason. Fix by moving ensure_mcp_connection before load_conversation_history in the streaming handler and passing stm_enabled && !has_mcp_tools to the loader, so the cap check and rejection are skipped for the MCP streaming path. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
Fetching cap+1 items from list_items is harmless — it is just a SQL LIMIT. Remove the conditional fetch_limit calculation and always pass cap.saturating_add(1) directly. The rejection check already carries the correct stm_enabled && store=true gate, so no behaviour changes. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5410bb19c6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…p_streaming The MCP streaming path never calls persist_response_if_needed so conversation_turn_info was accepted but immediately discarded (_prefix). Remove the parameter from execute_tool_loop_streaming and its call site in the streaming handler. Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
Fetching cap+1 and checking > cap is equivalent to fetching cap and checking >= cap, but the latter is simpler: the SQL LIMIT already bounds the inference window so no post-fetch truncation is needed. Applied to both load_input_history (OpenAI) and load_conversation_history (gRPC regular). Signed-off-by: spalimpaaces-star <saikiran.palimpati@oracle.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff8c4c25d9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This pull request has been automatically marked as stale because it has not had any activity within 14 days. It will be automatically closed if no further activity occurs within 16 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @spalimpaaces-star, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
Description
Problem
When STM condensation (STMO) is enabled, the Responses API does not write a job entry to the queue after persisting a conversation turn. As a result, STM condensation never gets triggered regardless of the configuration.
A second class of problems was identified during review: even with the scheduling wired up, ConversationTurnInfo — the struct carrying user_turns and total_items used to decide when and where to schedule the job — was being computed from the wrong input in several paths. This caused jobs to be enqueued at the wrong turn boundary or with an incorrect target_item_end, making STMO fire too early, too late, or point at the wrong position in the conversation.
Note: The actual DB write for the ConversationMemory job row is a no-op in this PR — the ConversationMemoryWriter implementation is currently a stub and will be replaced with a real Postgres/Oracle write in a follow-up PR.
Solution
Core feature
stm_enabledandstm_condenser_model_idfrom thex-conversation-memory-configrequest header, gated bymemory.runtime.enabledin server configuser_turns >= 4 && (user_turns - 1) % 3 == 0— fires at turns 4, 7, 10, 13, …stm_enabledis active, write aConversationMemoryrow withmemory_type = stmoandstatus = readyConversationTurnInfocorrectnessConversationTurnInfocarriesuser_turnsandtotal_items. Both values must be accurate — wrong values cause STMO to fire on the wrong turn or write a job pointing at the wrong position in the conversation. The following decisions were made to guarantee correctness across all paths:1. Compute from assembled input, not raw request
user_turnsandtotal_itemsare counted after history has been fully loaded and prepended torequest.input. Counting from the raw request alone givesuser_turns = 1regardless of how long the real conversation is.2. Use raw DB item count for
total_itemsThe conversation loaders filter stored items before building the inference window — gRPC regular keeps only messages, OpenAI drops reasoning items. Counting from the filtered window undercounts
total_items. We capturestored_items.len()before any filtering and useraw_count + current_input_countfortotal_itemssotarget_item_endpoints at the real end of the conversation.3. Skip STMO when history load fails
If
list_itemsreturns an error, the loaders warn and continue with only the current message inrequest.input. Rather than compute a wrongConversationTurnInfo, we returnNoneso persistence skips STMO entirely for that request. The signal isconversation.is_some() && raw_stored_item_count.is_none().4. Cap overflow detection
list_itemsis bounded bymax_conversation_history_items. If the conversation has grown past the cap, even the raw count is wrong. We always fetchcap + 1items — harmless since it is just a SQLLIMIT. If we getcap + 1rows back, we reject with400 conversation_too_large. This rejection only fires whenstm_enabled && store.unwrap_or(true)— read-only calls (store=false) are never affected sincepersist_response_if_neededalready short-circuits before any STMO enqueue.5. Harmony path —
request.conversationwithout history loadload_previous_messagesonly loads fromprevious_response_id. Ifrequest.conversationis set but no history is loaded for it, we returnNoneforturn_inforather than count from partial input.6. Skip raw-count correction when response chain is also loaded
conversationandprevious_response_idare mutually exclusive by API contract but not enforced inside the loaders. If both paths ran, the chain merge overwritesmodified_request.inputlast —count_conversation_turn_infoalready saw every item so applying the raw-count override would drop the chain items fromtotal_items. Guard: only apply the correction whenrequest.previous_response_id.is_none().7. MCP streaming path never persists
execute_tool_loop_streamingdoes not callpersist_response_if_needed, so STMO is never enqueued there. To prevent the cap check from rejecting those requests,ensure_mcp_connectionis moved beforeload_conversation_historyin the streaming handler andstm_enabled = falseis passed to the loader for the MCP streaming path.Changes
Section 1 — Config
config/types.rs— addedmax_conversation_history_items: usizeto router config (default 100)config/builder.rs— wires the value from YAML config into the structconfig/validation.rs— rejects 0 at startupSection 2 — Header parsing →
MemoryExecutionContextrouters/common/header_utils.rs— parsesstm_enabledandstm_condenser_model_idfromx-conversation-memory-configmemory/context.rs—stm_enabledpromoted toMemoryExecutionState(3-state enum:NotRequested/GatedOff/Active) matching the LTM pattern;stm_condenser_model_idaddedrouters/grpc/router.rs— builds a freshMemoryExecutionContextper request from headersrouters/grpc/common/responses/context.rs— addsPersistenceHandlesstruct grouping the four storage backends; addsmemory_execution_contexttoResponsesContextrouters/openai/context.rs— same wiring for the OpenAI Responses pathSection 3 — History loading with correctness guards
routers/openai/responses/history.rs—load_input_history: cap overflow detection, skip-on-failure guard, raw-count correction, chain-overlap guard,store=falsegaterouters/grpc/regular/responses/common.rs—load_conversation_history: same guards as aboverouters/grpc/regular/responses/handlers.rs— streaming handler:ensure_mcp_connectionmoved before history loading; passesstm_enabled && !has_mcp_toolsto loaderrouters/grpc/regular/responses/streaming.rs— removed unused_conversation_turn_infoparameter fromexecute_tool_loop_streamingrouters/grpc/harmony/responses/common.rs—load_previous_messages: early-return guard onrequest.conversation.is_none()Section 4 —
ConversationTurnInfocomputationrouters/common/persistence_utils.rs—count_conversation_turn_infocountsuser_turns(role ="user") andtotal_itemsfrom the fully assembled inputSection 5 — Execute and persist (plumbing)
routers/grpc/common/responses/handlers.rs— gRPC persist call siterouters/grpc/common/responses/mod.rs— export updaterouters/grpc/common/responses/utils.rs— persist helper utilitiesrouters/grpc/harmony/responses/non_streaming.rs— passesconversation_turn_infoto persistrouters/grpc/harmony/responses/streaming.rs— same for streaming pathrouters/grpc/regular/responses/non_streaming.rs— same for regular non-streaming pathrouters/openai/responses/non_streaming.rs— OpenAI persist call siterouters/openai/responses/streaming.rs— same for streaming pathrouters/openai/responses/route.rs— OpenAI entry pointSection 6 — STMO boundary check and job write
routers/common/persistence_utils.rs—handle_stmo_after_persist: evaluates boundary, writesConversationMemoryrow withmemory_type = stmo,status = ready, andtarget_item_endfromtotal_itemsTest Plan
Unit tests (persistence_utils.rs):
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Bug Fixes / Validation