fix: 修复 Codex WebSocket 会话在额度耗尽后不切换账号#2256
fix: 修复 Codex WebSocket 会话在额度耗尽后不切换账号#2256jacob-ayang wants to merge 8 commits intorouter-for-me:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求旨在增强 Codex WebSocket 服务的健壮性,解决了一个关键问题:当某个账号的额度耗尽时,WebSocket 会话会持续尝试使用该账号,导致服务中断。通过引入在限额错误后解除账号绑定并关闭旧会话的机制,以及在认证信息变更时强制重建连接的逻辑,确保了服务能够动态切换到可用的账号,从而提高了服务的可靠性和用户体验。 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc5cb31186
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if shouldResetResponsesWebsocketAuthPin(terminalStatus) { | ||
| // 限额错误后解除 pin 让后续请求重新选可用账号 | ||
| pinnedAuthID = "" | ||
| if h != nil && h.AuthManager != nil { |
There was a problem hiding this comment.
Reset incremental state when unpinning after quota errors
After a quota-class terminal error this block clears pinnedAuthID and closes the execution session, but it leaves request state unchanged; on the next turn, allowIncrementalInputWithPreviousResponseID can still be true (model-level), so normalizeResponseSubsequentRequest may forward a client previous_response_id that belongs to the old account to the newly selected account. In websocket v2 flows that send previous_response_id, this causes the post-switch request to fail with upstream 4xx instead of recovering, so the auth-rotation fix is incomplete unless incremental mode is disabled/reset for at least the next turn.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
本次 PR 修复了 Codex WebSocket 会话在额度耗尽后不切换账号的问题。主要通过两个机制实现:1. 当会话的认证信息变化时,强制重新建立上游连接。2. 收到额度相关的错误时,解除会话的认证信息锁定并关闭旧的连接。这些改动实现良好,并包含了相应的回归测试。审查中发现一处新测试代码可以改进,以增强其健壮性。
I am having trouble creating individual review comments. Click here to see my feedback.
internal/runtime/executor/codex_websockets_executor_test.go (269-278)
为了提高测试的稳定性和效率,建议使用 channel 来同步测试中的并发操作,而不是 time.Sleep 和轮询。通过在服务器处理程序中发送信号,测试可以确定地等待握手完成,从而避免潜在的竞态条件和不必要的延迟。
例如,您可以创建一个 channel,并在服务器每次收到授权时向其发送一个值。然后,测试可以等待从该 channel 接收到预期数量的值。
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32e8d1dff7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| sess.connMu.Unlock() | ||
| if conn != nil && currentAuthID != authID { | ||
| // 账号切换时先断开旧连接避免继续复用旧账号 | ||
| e.invalidateUpstreamConn(sess, conn, "auth_switched", nil) |
There was a problem hiding this comment.
Avoid tearing down active reader channel on auth switch
Calling invalidateUpstreamConn during auth changes closes the old socket immediately, but the old readUpstreamLoop can still wake later and run its error path, which unconditionally clearActive/closes whatever channel is currently active for the session. If a new request has already installed its own activeCh, that channel gets closed and readCodexWebsocketMessage returns session read channel closed, aborting a healthy post-switch request. This race is specific to in-session auth switching and can break the new quota-recovery flow intermittently.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: Solid fix for a real production issue — quota-exhausted accounts getting stuck in WebSocket session pins. The approach of unpinning on 429/402/403 and invalidating stale auth connections is correct.
Observations
1. shouldResetResponsesWebsocketAuthPin — good, but consider 401
The function handles 429 (rate limit), 403 (forbidden), and 402 (payment required). Consider whether 401 (unauthorized) should also trigger an unpin — a revoked or expired key would otherwise keep the session stuck on a dead account.
2. Race condition window in ensureUpstreamConn
Between reading sess.conn / sess.authID and calling invalidateUpstreamConn, another goroutine could establish a new connection. The lock is released before the comparison happens:
sess.connMu.Lock()
conn := sess.conn
currentAuthID := strings.TrimSpace(sess.authID)
sess.connMu.Unlock()
if conn != nil && currentAuthID != authID {
e.invalidateUpstreamConn(sess, conn, "auth_switched", nil)If invalidateUpstreamConn does its own locking internally this is probably fine, but worth verifying it handles the "conn already replaced" case gracefully.
3. CloseExecutionSession nil check
if h != nil && h.AuthManager != nil {
h.AuthManager.CloseExecutionSession(passthroughSessionID)
}The h != nil guard is redundant — you are inside a method on h. Minor, but removing it makes the intent clearer.
4. Test coverage
The test for forwardResponsesWebsocket now validates statusCode return — good. Would also be worth adding a test case that sends a 429 error through the channel and asserts statusCode == 429 to directly validate the new return value propagation.
5. Chinese comments
The inline comments (e.g. // 账号切换时先断开旧连接避免继续复用旧账号) are clear and accurate. Consider adding English translations for broader contributor accessibility, or at minimum keeping them bilingual.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9b9f77ed8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // 限额错误后解除 pin 让后续请求重新选可用账号 | ||
| pinnedAuthID = "" | ||
| // 切号恢复阶段先禁用增量模式避免沿用旧账号 response id | ||
| forceDisableIncrementalAfterAuthReset = true |
There was a problem hiding this comment.
Preserve transcript when forcing non-incremental recovery
Setting forceDisableIncrementalAfterAuthReset = true after quota-class errors makes the next turn fall back to normalizeResponseSubsequentRequest's non-incremental merge path, but the in-memory state is still incremental-form state (lastRequest was already replaced by the previous incremental payload and failed turns leave lastResponseOutput as []). In websocket v2 sessions that send only delta input with previous_response_id, this causes the post-switch request to be built from a truncated history (often just the last user delta), so recovery succeeds but silently drops earlier conversation context.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: WebSocket Account Switching on Quota Exhaustion
This tackles a real production pain point: when a Codex account hits its rate limit mid-session, the WebSocket connection needs to seamlessly switch to another account. The approach is methodical, but there are a few areas worth scrutinizing.
✅ Strengths
isCurrentConn()guard pattern is clean: prevents stale read loops from poisoning new connections- The
forceDisableIncrementalAfterAuthResetflag correctly prevents staleprevious_response_idreferences from leaking across accounts - Shadow snapshot logic maintains a full transcript for recovery while sending only deltas on the wire
- Good test coverage update for the new return signature of
forwardResponsesWebsocket
⚠️ Concerns
-
Race condition in
readUpstreamLoop: TheisCurrentConn(conn)check and subsequentactiveMu.Lock()are not atomic. Between the check returning true and acquiring the lock,invalidateUpstreamConncould run from another goroutine, swapping the connection. The window is narrow but under high load with fast quota exhaustion, it could lead to a message being dispatched to a channel that's about to be closed. Consider holdingconnMuthrough the dispatch, or documenting why the race is benign. -
Shadow snapshot error handling: When
normalizeResponsesWebsocketRequestWithModefails for the shadow snapshot, the code falls back tolastRequestwith a warning log. But this means after the fallback, the next request's incremental diff will be computed against a potentially stale base, which could cause a desync. Is there a mechanism to force a full re-send after this failure path? -
shouldResetResponsesWebsocketAuthPinscope: The function triggers on 429/403/402. 403 can also mean "invalid API key" or permission issues unrelated to quota. Resetting the auth pin on a permanent 403 could cause an infinite loop of account switches if the issue is systemic. Consider checking the response body for quota-specific error codes rather than just HTTP status. -
Missing test for the auth-switch path: The test update validates the new return signature, but there's no test for the actual account-switching flow (e.g., simulating a 429, verifying the pin is cleared, verifying incremental mode is disabled, and verifying it re-enables after success). This is the most complex new behavior and would benefit from dedicated test coverage.
-
authID = strings.TrimSpace(authID)placement: This is done after the nil check onsess, which means the first code path (sess == nil) uses the untrimmed authID when callingdialCodexWebsocket. If upstream ever sends whitespace-padded auth IDs, the initial dial and subsequent reconnects would use different IDs. Minor, but worth aligning.
Overall
The core design is sound. The main risk is the auth-switch path lacking dedicated test coverage for what is effectively a state machine transition (normal → quota-exhausted → pin-cleared → incremental-disabled → new-account-succeeds → incremental-re-enabled). A table-driven test walking through these states would significantly reduce regression risk.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: Solid fix for WebSocket session account switching on quota exhaustion. The approach of detecting auth ID changes, invalidating stale connections, maintaining full transcript snapshots for recovery, and disabling incremental mode during transitions is well thought through.
Strengths
-
isCurrentConnguard — Nice defensive pattern. The read loop correctly checks whether the connection it is reading from is still the active one before dispatching messages. This prevents stale connection data from corrupting new sessions. -
Auth reset flow — The
shouldResetResponsesWebsocketAuthPinfunction cleanly maps HTTP 429/403/402 to auth pin resets. TheforceDisableIncrementalAfterAuthResetflag prevents using staleprevious_response_idfrom the old account. Re-enabling incremental mode after the first successful response on the new account is correct. -
Shadow snapshot —
shouldBuildResponsesWebsocketFullSnapshot+ the shadow normalization ensureslastRequestalways holds the full conversation transcript, even when only deltas are sent upstream. This is critical for recovery after account switching.
Concerns
-
Race condition in
ensureUpstreamConn— The auth ID comparison and invalidation happen outside theconnMulock:sess.connMu.Lock() conn := sess.conn readerConn := sess.readerConn currentAuthID := strings.TrimSpace(sess.authID) sess.connMu.Unlock() if conn != nil && currentAuthID != authID { e.invalidateUpstreamConn(sess, conn, "auth_switched", nil)
Between releasing the lock and calling
invalidateUpstreamConn, another goroutine could modifysess.conn. IfinvalidateUpstreamConnchecks identity internally, this may be fine, but worth verifying there is no window for a double-invalidation or use-after-close. -
forwardResponsesWebsocketreturn signature change — The function now returns([]byte, int, error)instead of([]byte, error). Every callsite needs updating. The test is updated, but if there are any other callers in the codebase not touched by this PR, they would break at compile time (Go would catch this). Just confirming all callers are covered. -
Shadow snapshot error handling — When the shadow snapshot fails, the code keeps
lastRequestas the previous snapshot and logs a warning. This is the right fallback, but the log line usesshadowErr.StatusCodeandshadowErr.Error— are these always populated? IfshadowErris a custom error type, make sure the zero values are meaningful. -
No test for the auth-switch path — The existing test validates that
forwardResponsesWebsocketreturns the new status code. But there is no test that exercises the full account-switch flow (quota exhaustion → auth pin reset → incremental disable → successful retry → incremental re-enable). Consider adding an integration/unit test for this lifecycle.
Minor
- Chinese comments are fine for internal teams but may want English translations if this is an open-source PR.
Good work on a tricky concurrency problem. The defensive isCurrentConn checks throughout the read loop are the right approach.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: Solid fix for a real problem — stale auth pins after quota exhaustion causing repeated 429s. The approach of detecting rate-limit/forbidden/payment-required status codes and resetting the auth pin is correct.
Positives
- The
isCurrentConn()guard in the read loop is well-placed — prevents stale goroutines from poisoning active channels after auth switch. - Shadow snapshot approach (
nextSessionRequestSnapshot) maintains full transcript for recovery while still sending deltas in incremental mode. Good separation. - Test coverage for the
forwardResponsesWebsocketsignature change.
Suggestions
-
Race window in
isCurrentConn: The check-then-act pattern (if !sess.isCurrentConn(conn) { return }→ then proceed to usesess.activeCh) has a small window where the conn could be swapped between the check and the lock acquisition onactiveMu. In practice the stale goroutine will just silently return on the next iteration, so this is likely fine — but worth a comment noting the benign race. -
authIDtrimming inconsistency:authID = strings.TrimSpace(authID)is done after the nil-session early return, but the storedsess.authIDis trimmed at comparison time. Ifsess.authIDis set elsewhere without trimming, there could be a mismatch. Consider trimming at write time instead (single point of truth). -
Missing test for
shouldResetResponsesWebsocketAuthPin: The new helper checks 429, 403, 402 — but there are no unit tests for it. A small table-driven test would prevent future regressions if someone adds/removes status codes. -
forceDisableIncrementalAfterAuthResetreset condition: It only resets whenterminalStatus == 0. What happens if the next request succeeds with a 200 but the stream closes normally (status 0)? Seems correct, but the variable name suggests it should reset on any success, which could be confusing. A comment clarifying "status 0 means successful completion" would help.
No blocking issues. Nice fix.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: Solid fix for WebSocket session reuse after auth quota exhaustion. Key observations:
-
isCurrentConn guard - Good pattern for preventing stale reader loops from interfering with new connections. The multiple check points in readUpstreamLoop (before read, on error, on binary, before dispatch) provide defense in depth.
-
Auth switch detection in ensureUpstreamConn - Comparing trimmed authIDs and invalidating the old connection on mismatch is correct. The TrimSpace normalization prevents subtle whitespace mismatches.
-
forceDisableIncrementalAfterAuthReset flag - Smart approach to prevent the new account from trying to reference a previous_response_id that belongs to the old account's session. Re-enabling after a successful round-trip (terminalStatus == 0) is the right recovery strategy.
-
Shadow snapshot for full transcript - The dual-path normalization (incremental for the wire, full snapshot for recovery) is well thought out. The fallback to lastRequest on shadow error prevents state corruption.
-
shouldResetResponsesWebsocketAuthPin - Covering 429, 403, and 402 is appropriate. Consider whether 401 (Unauthorized) should also trigger a reset, since expired tokens could present similarly to quota exhaustion.
-
Test coverage gap - The forwardResponsesWebsocket signature change is tested for the success path but there are no tests for the auth-reset flow (verifying pinnedAuthID gets cleared, incremental mode disabled/re-enabled). These state transitions are the core of the fix and would benefit from dedicated test cases.
-
Race consideration - isCurrentConn acquires connMu to compare, but between the check and the subsequent activeMu acquisition in readUpstreamLoop, the connection could theoretically change. The window is tiny and the worst case is a single stale message delivery, which seems acceptable.
Clean, well-commented PR. The test gap is the main item worth addressing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42acfa729d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if !current { | ||
| // 旧连接读错时不触碰当前活跃通道 | ||
| return | ||
| } |
There was a problem hiding this comment.
Notify active reader when old conn is force-closed
Do not return immediately on !current here, because closeExecutionSession sets sess.conn = nil before closing the socket; when that close unblocks ReadMessage, this branch exits without sending an error to the active channel. In that case readCodexWebsocketMessage keeps waiting on readCh for the in-flight request until its context is canceled, so explicit session shutdown (for example CloseExecutionSession/executor replacement paths) can hang requests instead of failing fast.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: Solid fix for a real race condition when Codex WebSocket sessions switch accounts after quota exhaustion. The approach is sound.
Positives
activeSnapshotForCurrentConnandclearActiveForCurrentConncorrectly scope channel operations to the owning connection, preventing stale read loops from poisoning new sessions.- The
forceDisableIncrementalAfterAuthResetflag is a good defensive measure to avoid sendingprevious_response_idfrom an account that no longer owns the upstream session. - Shadow snapshot logic (
shouldBuildResponsesWebsocketFullSnapshot) preserves full transcript for recovery while still sending deltas upstream.
Issues
-
Lock ordering concern in
activeSnapshotForCurrentConn(line ~136): You acquireconnMuthenactiveMu. InclearActiveForCurrentConnthe same order is used, which is correct. However, the existingclearActiveandsetActivemethods only useactiveMu. If any code path callsclearActivewhile another goroutine holdsconnMu, you could get a subtle deadlock. Recommend adding a comment documenting the lock ordering invariant (connMu→activeMu, never reversed), and auditing existing callers to ensure the originalclearActive/setActiveare never called underconnMu. -
authIDtrimming inconsistency:authID = strings.TrimSpace(authID)is done inline inensureUpstreamConn, butsess.authIDis trimmed viacurrentAuthID := strings.TrimSpace(sess.authID). Ifsess.authIDwas set untrimmed duringdialCodexWebsocket, subsequent comparisons work only because you trim both sides. Cleaner to normalize at write time (when settingsess.authID) so all downstream reads are safe. -
Missing test coverage for account switching path: The existing test only validates
forwardResponsesWebsocketreturn values. The core behavior change (auth switch → connection invalidation → incremental mode disable → re-enable after success) has no integration test. Even a table-driven unit test that sequences request/response events through the WebSocket handler would catch regressions here. -
forceDisableIncrementalAfterAuthResetreset condition (line ~237): It only resets whenterminalStatus == 0. If the new account returns a non-zero but non-auth-related error (e.g., 500), incremental mode stays disabled permanently for the session. Consider resetting on any non-auth-error status, or after a successful round-trip regardless of intermediate failures.
Minor
- Chinese comments are fine for the team, but consider adding English summaries for the key invariants (lock ordering, auth-pin lifecycle) since external contributors also touch this file.
Good work on a tricky concurrency fix. The points above are mostly hardening suggestions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9a98a3838
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // lastRequest 始终保存完整 transcript 快照 | ||
| lastRequest = nextSessionRequestSnapshot |
There was a problem hiding this comment.
Defer snapshot advance until turn succeeds
Updating lastRequest before the upstream turn finishes means failed turns are treated as committed history. After a quota-class failure, this code forces the next request into non-incremental merge mode; if the client retries the same previous_response_id+input, normalizeResponseSubsequentRequest will merge that input on top of an already-advanced snapshot, duplicating the retried user/tool items in the reconstructed transcript. This is specific to the new auth-reset recovery path and can silently alter model behavior after retries.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: Codex WebSocket Account Switching Fix
Good fix for a real race condition issue. The core problem — WebSocket sessions continuing to use exhausted account connections — is correctly addressed. A few observations:
Positives
- Lock ordering discipline (
connMu → activeMu) is consistently applied via the newactiveSnapshotForCurrentConn/clearActiveForCurrentConnhelpers. This eliminates the previous race between stale connection readers and new requests. trySendCodexWebsocketRead/tryCloseCodexWebsocketReadwithrecover()guards are a sound defensive pattern for concurrent channel operations.failActiveForSessionCloseproperly unblocks hanging reads when a session is explicitly closed during account switch.- Shadow snapshot approach in the WebSocket handler is clever — maintaining a full request snapshot alongside incremental deltas means account switches can rebuild context without losing conversation history.
- Test coverage is solid:
TestCloseExecutionSessionUnblocksActiveReadandTestReadCodexWebsocketMessageReturnsWhenReadChannelCloseddirectly exercise the failure paths, andTestShouldResetResponsesWebsocketAuthPincovers the status code boundary.
Issues to Consider
-
authIDtrimming inconsistency (line ~1173):ensureUpstreamConntrimsauthIDon entry, butsess.authIDis stored un-trimmed elsewhere. If any caller stores an authID with whitespace, thecurrentAuthID != authIDcomparison will trigger unnecessary reconnections even without an actual account switch. Consider trimming at storage time rather than comparison time. -
shouldResetResponsesWebsocketAuthPinscope: The function resets on 429, 403, and 402. A 403 could be a legitimate permission error on a specific resource rather than quota exhaustion. This could cause unnecessary account rotation for non-quota 403s. Consider logging the specific error reason to distinguish. -
forceDisableIncrementalAfterAuthResetstate machine: The flag is reset after one successful round (terminalStatus == 0). If the new account also fails (say a transient 500),forceDisableIncrementalAfterAuthResetstays true, which is correct. But if the new account hits a 403 for a different reason, it triggers another reset cycle. This could cascade through accounts quickly. Worth adding a cycle counter or backoff? -
Minor: Comments are in Chinese, which is fine for the contributor's team, but could be a maintenance concern if the broader team reads this code. Consider bilingual comments or English-only per project convention.
Overall this is a well-structured fix for a tricky concurrency problem. The test coverage for the failure paths is the strongest part.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review: Codex WebSocket Account Switching Fix
Overall: Solid fix for a real race condition. The lock ordering discipline (connMu then activeMu) and connection-scoped channel operations are the right approach.
Bugs / Race Conditions:
-
failActiveForSessionClose does not verify connection ownership. Unlike clearActiveForCurrentConn and activeSnapshotForCurrentConn, this method only takes activeMu and does not check if conn still matches s.conn. If a new connection is established between the session close being initiated and failActiveForSessionClose running, it could send an error to the new connection active channel. Consider adding the same connMu then activeMu lock ordering and conn check here.
-
trySendCodexWebsocketRead silently swallows panics via recover(). The default case in the select means if ch is full, the event is dropped silently. Combined with recover(), a closed-channel write is also silently eaten. A log line on recover() would help debug production issues where events mysteriously vanish.
Suggestions:
-
Shadow snapshot fallback logging in openai_responses_websocket.go: The log.Warnf on shadow snapshot failure is good, but consider log.Errorf since a failed snapshot means the session will send a stale full context on next account switch.
-
forceDisableIncrementalAfterAuthReset resets to false only on terminalStatus == 0. If the new account also hits a non-zero terminal status, incremental mode stays disabled indefinitely. Worth a comment explaining why this is acceptable.
Tests: Good coverage of CloseExecutionSession unblocking and closed-channel read. Consider adding a test for the auth-switch reconnection path in ensureUpstreamConn.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review — Codex WebSocket Account Switching Fix
Good fix for a real race condition. The consistent lock ordering (connMu → activeMu) and the activeSnapshotForCurrentConn pattern are solid improvements. A few observations:
🐛 Potential Issues
-
trySendCodexWebsocketRead/tryCloseCodexWebsocketReadsilently swallow panics viarecover()— This masks bugs. Ifclose(ch)panics because the channel was already closed, that indicates a double-close race that should be fixed at the call site, not silenced. Consider logging the recovered panic at minimum, or restructuring so double-close cannot happen. -
failActiveForSessionClosedoes not holdconnMu— Unlike the other new methods that enforceconnMu → activeMuordering, this one only locksactiveMu. If another goroutine is doingensureUpstreamConnconcurrently (which holdsconnMu), the session could have its active channel fail-notified and then immediately re-assigned. This seems intentional (session is closing), but worth a comment explaining why the ordering is safe here. -
shouldResetResponsesWebsocketAuthPindoes not include 401 Unauthorized — The function resets on 429, 403, and 402, but not 401. If an auth token is revoked mid-session, the pin would stick to a dead credential. Was 401 intentionally excluded?
🔒 Thread Safety
isCurrentConntakesconnMubut returns a bool that can become stale — BetweenisCurrentConnreturning true and the caller acting on it, the connection could change. TheactiveSnapshotForCurrentConnmethod handles this correctly by keeping both locks, but the standaloneisCurrentConnin the read loop top could race. Low risk since the worst case is one extra loop iteration before exit, but worth noting.
✅ What looks good
- Shadow snapshot (
nextSessionRequestSnapshot) only commits on success — prevents failed rounds from corrupting session state forceDisableIncrementalAfterAuthResetprevents staleprevious_response_idafter account switch- Tests cover the channel-close and session-close fast-fail paths
- Consistent Chinese comments documenting lock ordering intent
📝 Test coverage gap
- No test for the auth-switch path in
ensureUpstreamConn(wherecurrentAuthID != authIDtriggers invalidation). This is the core scenario the PR fixes — would be good to add a test that verifies a connection is torn down when the auth ID changes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7e7cc1a4cc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if ch != nil { | ||
| // 会话关闭时允许主动 fail active 唤醒在途 readCodexWebsocketMessage | ||
| trySendCodexWebsocketRead(ch, done, codexWebsocketRead{conn: conn, err: fmt.Errorf("codex websockets executor: execution session closed")}) | ||
| tryCloseCodexWebsocketRead(ch) |
There was a problem hiding this comment.
Avoid closing active read channel during session teardown
closeExecutionSession now closes activeCh directly, but readUpstreamLoop still forwards text frames with a plain ch <- ... send. If CloseExecutionSession/closeAllExecutionSessions runs while a stream is in flight, that concurrent close(ch) can race with the sender and trigger a send on closed channel panic, crashing the process instead of failing the request cleanly.
Useful? React with 👍 / 👎.
问题
修复
验证
关联