Skip to content

fix: 修复 Codex WebSocket 会话在额度耗尽后不切换账号#2256

Open
jacob-ayang wants to merge 8 commits intorouter-for-me:mainfrom
jacob-ayang:fix/issue-2230-codex-ws-auth-switch
Open

fix: 修复 Codex WebSocket 会话在额度耗尽后不切换账号#2256
jacob-ayang wants to merge 8 commits intorouter-for-me:mainfrom
jacob-ayang:fix/issue-2230-codex-ws-auth-switch

Conversation

@jacob-ayang
Copy link

问题

  • Codex WebSocket 会话中账号被固定
  • 账号额度耗尽后后续请求仍反复命中同一账号

修复

  • 限额类错误后解除会话 pin 并关闭旧 execution session
  • execution session 内 auth 变化时强制重建上游 websocket 连接

验证

  • 新增会话内 429 后切号回归测试
  • 新增 auth 切换触发重连回归测试
  • 相关包测试通过

关联

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在增强 Codex WebSocket 服务的健壮性,解决了一个关键问题:当某个账号的额度耗尽时,WebSocket 会话会持续尝试使用该账号,导致服务中断。通过引入在限额错误后解除账号绑定并关闭旧会话的机制,以及在认证信息变更时强制重建连接的逻辑,确保了服务能够动态切换到可用的账号,从而提高了服务的可靠性和用户体验。

Highlights

  • 修复 Codex WebSocket 会话账号固定问题: 解决了 Codex WebSocket 会话中账号被固定,导致额度耗尽后仍反复命中同一账号的问题。
  • 限额错误后解除会话 Pin: 在遇到限额类错误(如 429 Too Many Requests)后,解除会话的账号 Pin 并关闭旧的执行会话,以便后续请求可以重新选择可用账号。
  • Auth 变化时强制重建上游连接: 当执行会话内的认证信息发生变化时,强制断开并重建上游 WebSocket 连接,确保使用最新的认证信息。
  • 新增回归测试: 增加了新的回归测试用例,验证在会话内遇到 429 错误后能够正确切换账号,以及在认证信息切换时能够触发重连。

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc5cb31186

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +202 to +205
if shouldResetResponsesWebsocketAuthPin(terminalStatus) {
// 限额错误后解除 pin 让后续请求重新选可用账号
pinnedAuthID = ""
if h != nil && h.AuthManager != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset incremental state when unpinning after quota errors

After a quota-class terminal error this block clears pinnedAuthID and closes the execution session, but it leaves request state unchanged; on the next turn, allowIncrementalInputWithPreviousResponseID can still be true (model-level), so normalizeResponseSubsequentRequest may forward a client previous_response_id that belongs to the old account to the newly selected account. In websocket v2 flows that send previous_response_id, this causes the post-switch request to fail with upstream 4xx instead of recovering, so the auth-rotation fix is incomplete unless incremental mode is disabled/reset for at least the next turn.

Useful? React with 👍 / 👎.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 修复了 Codex WebSocket 会话在额度耗尽后不切换账号的问题。主要通过两个机制实现:1. 当会话的认证信息变化时,强制重新建立上游连接。2. 收到额度相关的错误时,解除会话的认证信息锁定并关闭旧的连接。这些改动实现良好,并包含了相应的回归测试。审查中发现一处新测试代码可以改进,以增强其健壮性。

I am having trouble creating individual review comments. Click here to see my feedback.

internal/runtime/executor/codex_websockets_executor_test.go (269-278)

medium

为了提高测试的稳定性和效率,建议使用 channel 来同步测试中的并发操作,而不是 time.Sleep 和轮询。通过在服务器处理程序中发送信号,测试可以确定地等待握手完成,从而避免潜在的竞态条件和不必要的延迟。

例如,您可以创建一个 channel,并在服务器每次收到授权时向其发送一个值。然后,测试可以等待从该 channel 接收到预期数量的值。

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32e8d1dff7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sess.connMu.Unlock()
if conn != nil && currentAuthID != authID {
// 账号切换时先断开旧连接避免继续复用旧账号
e.invalidateUpstreamConn(sess, conn, "auth_switched", nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid tearing down active reader channel on auth switch

Calling invalidateUpstreamConn during auth changes closes the old socket immediately, but the old readUpstreamLoop can still wake later and run its error path, which unconditionally clearActive/closes whatever channel is currently active for the session. If a new request has already installed its own activeCh, that channel gets closed and readCodexWebsocketMessage returns session read channel closed, aborting a healthy post-switch request. This race is specific to in-session auth switching and can break the new quota-recovery flow intermittently.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Solid fix for a real production issue — quota-exhausted accounts getting stuck in WebSocket session pins. The approach of unpinning on 429/402/403 and invalidating stale auth connections is correct.

Observations

1. shouldResetResponsesWebsocketAuthPin — good, but consider 401
The function handles 429 (rate limit), 403 (forbidden), and 402 (payment required). Consider whether 401 (unauthorized) should also trigger an unpin — a revoked or expired key would otherwise keep the session stuck on a dead account.

2. Race condition window in ensureUpstreamConn
Between reading sess.conn / sess.authID and calling invalidateUpstreamConn, another goroutine could establish a new connection. The lock is released before the comparison happens:

sess.connMu.Lock()
conn := sess.conn
currentAuthID := strings.TrimSpace(sess.authID)
sess.connMu.Unlock()
if conn != nil && currentAuthID != authID {
    e.invalidateUpstreamConn(sess, conn, "auth_switched", nil)

If invalidateUpstreamConn does its own locking internally this is probably fine, but worth verifying it handles the "conn already replaced" case gracefully.

3. CloseExecutionSession nil check

if h != nil && h.AuthManager != nil {
    h.AuthManager.CloseExecutionSession(passthroughSessionID)
}

The h != nil guard is redundant — you are inside a method on h. Minor, but removing it makes the intent clearer.

4. Test coverage
The test for forwardResponsesWebsocket now validates statusCode return — good. Would also be worth adding a test case that sends a 429 error through the channel and asserts statusCode == 429 to directly validate the new return value propagation.

5. Chinese comments
The inline comments (e.g. // 账号切换时先断开旧连接避免继续复用旧账号) are clear and accurate. Consider adding English translations for broader contributor accessibility, or at minimum keeping them bilingual.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9b9f77ed8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// 限额错误后解除 pin 让后续请求重新选可用账号
pinnedAuthID = ""
// 切号恢复阶段先禁用增量模式避免沿用旧账号 response id
forceDisableIncrementalAfterAuthReset = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve transcript when forcing non-incremental recovery

Setting forceDisableIncrementalAfterAuthReset = true after quota-class errors makes the next turn fall back to normalizeResponseSubsequentRequest's non-incremental merge path, but the in-memory state is still incremental-form state (lastRequest was already replaced by the previous incremental payload and failed turns leave lastResponseOutput as []). In websocket v2 sessions that send only delta input with previous_response_id, this causes the post-switch request to be built from a truncated history (often just the last user delta), so recovery succeeds but silently drops earlier conversation context.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: WebSocket Account Switching on Quota Exhaustion

This tackles a real production pain point: when a Codex account hits its rate limit mid-session, the WebSocket connection needs to seamlessly switch to another account. The approach is methodical, but there are a few areas worth scrutinizing.

✅ Strengths

  • isCurrentConn() guard pattern is clean: prevents stale read loops from poisoning new connections
  • The forceDisableIncrementalAfterAuthReset flag correctly prevents stale previous_response_id references from leaking across accounts
  • Shadow snapshot logic maintains a full transcript for recovery while sending only deltas on the wire
  • Good test coverage update for the new return signature of forwardResponsesWebsocket

⚠️ Concerns

  1. Race condition in readUpstreamLoop: The isCurrentConn(conn) check and subsequent activeMu.Lock() are not atomic. Between the check returning true and acquiring the lock, invalidateUpstreamConn could run from another goroutine, swapping the connection. The window is narrow but under high load with fast quota exhaustion, it could lead to a message being dispatched to a channel that's about to be closed. Consider holding connMu through the dispatch, or documenting why the race is benign.

  2. Shadow snapshot error handling: When normalizeResponsesWebsocketRequestWithMode fails for the shadow snapshot, the code falls back to lastRequest with a warning log. But this means after the fallback, the next request's incremental diff will be computed against a potentially stale base, which could cause a desync. Is there a mechanism to force a full re-send after this failure path?

  3. shouldResetResponsesWebsocketAuthPin scope: The function triggers on 429/403/402. 403 can also mean "invalid API key" or permission issues unrelated to quota. Resetting the auth pin on a permanent 403 could cause an infinite loop of account switches if the issue is systemic. Consider checking the response body for quota-specific error codes rather than just HTTP status.

  4. Missing test for the auth-switch path: The test update validates the new return signature, but there's no test for the actual account-switching flow (e.g., simulating a 429, verifying the pin is cleared, verifying incremental mode is disabled, and verifying it re-enables after success). This is the most complex new behavior and would benefit from dedicated test coverage.

  5. authID = strings.TrimSpace(authID) placement: This is done after the nil check on sess, which means the first code path (sess == nil) uses the untrimmed authID when calling dialCodexWebsocket. If upstream ever sends whitespace-padded auth IDs, the initial dial and subsequent reconnects would use different IDs. Minor, but worth aligning.

Overall

The core design is sound. The main risk is the auth-switch path lacking dedicated test coverage for what is effectively a state machine transition (normal → quota-exhausted → pin-cleared → incremental-disabled → new-account-succeeds → incremental-re-enabled). A table-driven test walking through these states would significantly reduce regression risk.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Solid fix for WebSocket session account switching on quota exhaustion. The approach of detecting auth ID changes, invalidating stale connections, maintaining full transcript snapshots for recovery, and disabling incremental mode during transitions is well thought through.

Strengths

  1. isCurrentConn guard — Nice defensive pattern. The read loop correctly checks whether the connection it is reading from is still the active one before dispatching messages. This prevents stale connection data from corrupting new sessions.

  2. Auth reset flow — The shouldResetResponsesWebsocketAuthPin function cleanly maps HTTP 429/403/402 to auth pin resets. The forceDisableIncrementalAfterAuthReset flag prevents using stale previous_response_id from the old account. Re-enabling incremental mode after the first successful response on the new account is correct.

  3. Shadow snapshotshouldBuildResponsesWebsocketFullSnapshot + the shadow normalization ensures lastRequest always holds the full conversation transcript, even when only deltas are sent upstream. This is critical for recovery after account switching.

Concerns

  1. Race condition in ensureUpstreamConn — The auth ID comparison and invalidation happen outside the connMu lock:

    sess.connMu.Lock()
    conn := sess.conn
    readerConn := sess.readerConn
    currentAuthID := strings.TrimSpace(sess.authID)
    sess.connMu.Unlock()
    if conn != nil && currentAuthID != authID {
        e.invalidateUpstreamConn(sess, conn, "auth_switched", nil)

    Between releasing the lock and calling invalidateUpstreamConn, another goroutine could modify sess.conn. If invalidateUpstreamConn checks identity internally, this may be fine, but worth verifying there is no window for a double-invalidation or use-after-close.

  2. forwardResponsesWebsocket return signature change — The function now returns ([]byte, int, error) instead of ([]byte, error). Every callsite needs updating. The test is updated, but if there are any other callers in the codebase not touched by this PR, they would break at compile time (Go would catch this). Just confirming all callers are covered.

  3. Shadow snapshot error handling — When the shadow snapshot fails, the code keeps lastRequest as the previous snapshot and logs a warning. This is the right fallback, but the log line uses shadowErr.StatusCode and shadowErr.Error — are these always populated? If shadowErr is a custom error type, make sure the zero values are meaningful.

  4. No test for the auth-switch path — The existing test validates that forwardResponsesWebsocket returns the new status code. But there is no test that exercises the full account-switch flow (quota exhaustion → auth pin reset → incremental disable → successful retry → incremental re-enable). Consider adding an integration/unit test for this lifecycle.

Minor

  • Chinese comments are fine for internal teams but may want English translations if this is an open-source PR.

Good work on a tricky concurrency problem. The defensive isCurrentConn checks throughout the read loop are the right approach.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Solid fix for a real problem — stale auth pins after quota exhaustion causing repeated 429s. The approach of detecting rate-limit/forbidden/payment-required status codes and resetting the auth pin is correct.

Positives

  • The isCurrentConn() guard in the read loop is well-placed — prevents stale goroutines from poisoning active channels after auth switch.
  • Shadow snapshot approach (nextSessionRequestSnapshot) maintains full transcript for recovery while still sending deltas in incremental mode. Good separation.
  • Test coverage for the forwardResponsesWebsocket signature change.

Suggestions

  1. Race window in isCurrentConn: The check-then-act pattern (if !sess.isCurrentConn(conn) { return } → then proceed to use sess.activeCh) has a small window where the conn could be swapped between the check and the lock acquisition on activeMu. In practice the stale goroutine will just silently return on the next iteration, so this is likely fine — but worth a comment noting the benign race.

  2. authID trimming inconsistency: authID = strings.TrimSpace(authID) is done after the nil-session early return, but the stored sess.authID is trimmed at comparison time. If sess.authID is set elsewhere without trimming, there could be a mismatch. Consider trimming at write time instead (single point of truth).

  3. Missing test for shouldResetResponsesWebsocketAuthPin: The new helper checks 429, 403, 402 — but there are no unit tests for it. A small table-driven test would prevent future regressions if someone adds/removes status codes.

  4. forceDisableIncrementalAfterAuthReset reset condition: It only resets when terminalStatus == 0. What happens if the next request succeeds with a 200 but the stream closes normally (status 0)? Seems correct, but the variable name suggests it should reset on any success, which could be confusing. A comment clarifying "status 0 means successful completion" would help.

No blocking issues. Nice fix.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Solid fix for WebSocket session reuse after auth quota exhaustion. Key observations:

  1. isCurrentConn guard - Good pattern for preventing stale reader loops from interfering with new connections. The multiple check points in readUpstreamLoop (before read, on error, on binary, before dispatch) provide defense in depth.

  2. Auth switch detection in ensureUpstreamConn - Comparing trimmed authIDs and invalidating the old connection on mismatch is correct. The TrimSpace normalization prevents subtle whitespace mismatches.

  3. forceDisableIncrementalAfterAuthReset flag - Smart approach to prevent the new account from trying to reference a previous_response_id that belongs to the old account's session. Re-enabling after a successful round-trip (terminalStatus == 0) is the right recovery strategy.

  4. Shadow snapshot for full transcript - The dual-path normalization (incremental for the wire, full snapshot for recovery) is well thought out. The fallback to lastRequest on shadow error prevents state corruption.

  5. shouldResetResponsesWebsocketAuthPin - Covering 429, 403, and 402 is appropriate. Consider whether 401 (Unauthorized) should also trigger a reset, since expired tokens could present similarly to quota exhaustion.

  6. Test coverage gap - The forwardResponsesWebsocket signature change is tested for the success path but there are no tests for the auth-reset flow (verifying pinnedAuthID gets cleared, incremental mode disabled/re-enabled). These state transitions are the core of the fix and would benefit from dedicated test cases.

  7. Race consideration - isCurrentConn acquires connMu to compare, but between the check and the subsequent activeMu acquisition in readUpstreamLoop, the connection could theoretically change. The window is tiny and the worst case is a single stale message delivery, which seems acceptable.

Clean, well-commented PR. The test gap is the main item worth addressing.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42acfa729d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1188 to +1191
if !current {
// 旧连接读错时不触碰当前活跃通道
return
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Notify active reader when old conn is force-closed

Do not return immediately on !current here, because closeExecutionSession sets sess.conn = nil before closing the socket; when that close unblocks ReadMessage, this branch exits without sending an error to the active channel. In that case readCodexWebsocketMessage keeps waiting on readCh for the in-flight request until its context is canceled, so explicit session shutdown (for example CloseExecutionSession/executor replacement paths) can hang requests instead of failing fast.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Solid fix for a real race condition when Codex WebSocket sessions switch accounts after quota exhaustion. The approach is sound.

Positives

  • activeSnapshotForCurrentConn and clearActiveForCurrentConn correctly scope channel operations to the owning connection, preventing stale read loops from poisoning new sessions.
  • The forceDisableIncrementalAfterAuthReset flag is a good defensive measure to avoid sending previous_response_id from an account that no longer owns the upstream session.
  • Shadow snapshot logic (shouldBuildResponsesWebsocketFullSnapshot) preserves full transcript for recovery while still sending deltas upstream.

Issues

  1. Lock ordering concern in activeSnapshotForCurrentConn (line ~136): You acquire connMu then activeMu. In clearActiveForCurrentConn the same order is used, which is correct. However, the existing clearActive and setActive methods only use activeMu. If any code path calls clearActive while another goroutine holds connMu, you could get a subtle deadlock. Recommend adding a comment documenting the lock ordering invariant (connMuactiveMu, never reversed), and auditing existing callers to ensure the original clearActive/setActive are never called under connMu.

  2. authID trimming inconsistency: authID = strings.TrimSpace(authID) is done inline in ensureUpstreamConn, but sess.authID is trimmed via currentAuthID := strings.TrimSpace(sess.authID). If sess.authID was set untrimmed during dialCodexWebsocket, subsequent comparisons work only because you trim both sides. Cleaner to normalize at write time (when setting sess.authID) so all downstream reads are safe.

  3. Missing test coverage for account switching path: The existing test only validates forwardResponsesWebsocket return values. The core behavior change (auth switch → connection invalidation → incremental mode disable → re-enable after success) has no integration test. Even a table-driven unit test that sequences request/response events through the WebSocket handler would catch regressions here.

  4. forceDisableIncrementalAfterAuthReset reset condition (line ~237): It only resets when terminalStatus == 0. If the new account returns a non-zero but non-auth-related error (e.g., 500), incremental mode stays disabled permanently for the session. Consider resetting on any non-auth-error status, or after a successful round-trip regardless of intermediate failures.

Minor

  • Chinese comments are fine for the team, but consider adding English summaries for the key invariants (lock ordering, auth-pin lifecycle) since external contributors also touch this file.

Good work on a tricky concurrency fix. The points above are mostly hardening suggestions.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9a98a3838

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +195 to +196
// lastRequest 始终保存完整 transcript 快照
lastRequest = nextSessionRequestSnapshot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Defer snapshot advance until turn succeeds

Updating lastRequest before the upstream turn finishes means failed turns are treated as committed history. After a quota-class failure, this code forces the next request into non-incremental merge mode; if the client retries the same previous_response_id+input, normalizeResponseSubsequentRequest will merge that input on top of an already-advanced snapshot, duplicating the retried user/tool items in the reconstructed transcript. This is specific to the new auth-reset recovery path and can silently alter model behavior after retries.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Codex WebSocket Account Switching Fix

Good fix for a real race condition issue. The core problem — WebSocket sessions continuing to use exhausted account connections — is correctly addressed. A few observations:

Positives

  • Lock ordering discipline (connMu → activeMu) is consistently applied via the new activeSnapshotForCurrentConn / clearActiveForCurrentConn helpers. This eliminates the previous race between stale connection readers and new requests.
  • trySendCodexWebsocketRead / tryCloseCodexWebsocketRead with recover() guards are a sound defensive pattern for concurrent channel operations.
  • failActiveForSessionClose properly unblocks hanging reads when a session is explicitly closed during account switch.
  • Shadow snapshot approach in the WebSocket handler is clever — maintaining a full request snapshot alongside incremental deltas means account switches can rebuild context without losing conversation history.
  • Test coverage is solid: TestCloseExecutionSessionUnblocksActiveRead and TestReadCodexWebsocketMessageReturnsWhenReadChannelClosed directly exercise the failure paths, and TestShouldResetResponsesWebsocketAuthPin covers the status code boundary.

Issues to Consider

  1. authID trimming inconsistency (line ~1173): ensureUpstreamConn trims authID on entry, but sess.authID is stored un-trimmed elsewhere. If any caller stores an authID with whitespace, the currentAuthID != authID comparison will trigger unnecessary reconnections even without an actual account switch. Consider trimming at storage time rather than comparison time.

  2. shouldResetResponsesWebsocketAuthPin scope: The function resets on 429, 403, and 402. A 403 could be a legitimate permission error on a specific resource rather than quota exhaustion. This could cause unnecessary account rotation for non-quota 403s. Consider logging the specific error reason to distinguish.

  3. forceDisableIncrementalAfterAuthReset state machine: The flag is reset after one successful round (terminalStatus == 0). If the new account also fails (say a transient 500), forceDisableIncrementalAfterAuthReset stays true, which is correct. But if the new account hits a 403 for a different reason, it triggers another reset cycle. This could cascade through accounts quickly. Worth adding a cycle counter or backoff?

  4. Minor: Comments are in Chinese, which is fine for the contributor's team, but could be a maintenance concern if the broader team reads this code. Consider bilingual comments or English-only per project convention.

Overall this is a well-structured fix for a tricky concurrency problem. The test coverage for the failure paths is the strongest part.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Codex WebSocket Account Switching Fix

Overall: Solid fix for a real race condition. The lock ordering discipline (connMu then activeMu) and connection-scoped channel operations are the right approach.

Bugs / Race Conditions:

  1. failActiveForSessionClose does not verify connection ownership. Unlike clearActiveForCurrentConn and activeSnapshotForCurrentConn, this method only takes activeMu and does not check if conn still matches s.conn. If a new connection is established between the session close being initiated and failActiveForSessionClose running, it could send an error to the new connection active channel. Consider adding the same connMu then activeMu lock ordering and conn check here.

  2. trySendCodexWebsocketRead silently swallows panics via recover(). The default case in the select means if ch is full, the event is dropped silently. Combined with recover(), a closed-channel write is also silently eaten. A log line on recover() would help debug production issues where events mysteriously vanish.

Suggestions:

  1. Shadow snapshot fallback logging in openai_responses_websocket.go: The log.Warnf on shadow snapshot failure is good, but consider log.Errorf since a failed snapshot means the session will send a stale full context on next account switch.

  2. forceDisableIncrementalAfterAuthReset resets to false only on terminalStatus == 0. If the new account also hits a non-zero terminal status, incremental mode stays disabled indefinitely. Worth a comment explaining why this is acceptable.

Tests: Good coverage of CloseExecutionSession unblocking and closed-channel read. Consider adding a test for the auth-switch reconnection path in ensureUpstreamConn.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Codex WebSocket Account Switching Fix

Good fix for a real race condition. The consistent lock ordering (connMu → activeMu) and the activeSnapshotForCurrentConn pattern are solid improvements. A few observations:

🐛 Potential Issues

  1. trySendCodexWebsocketRead / tryCloseCodexWebsocketRead silently swallow panics via recover() — This masks bugs. If close(ch) panics because the channel was already closed, that indicates a double-close race that should be fixed at the call site, not silenced. Consider logging the recovered panic at minimum, or restructuring so double-close cannot happen.

  2. failActiveForSessionClose does not hold connMu — Unlike the other new methods that enforce connMu → activeMu ordering, this one only locks activeMu. If another goroutine is doing ensureUpstreamConn concurrently (which holds connMu), the session could have its active channel fail-notified and then immediately re-assigned. This seems intentional (session is closing), but worth a comment explaining why the ordering is safe here.

  3. shouldResetResponsesWebsocketAuthPin does not include 401 Unauthorized — The function resets on 429, 403, and 402, but not 401. If an auth token is revoked mid-session, the pin would stick to a dead credential. Was 401 intentionally excluded?

🔒 Thread Safety

  1. isCurrentConn takes connMu but returns a bool that can become stale — Between isCurrentConn returning true and the caller acting on it, the connection could change. The activeSnapshotForCurrentConn method handles this correctly by keeping both locks, but the standalone isCurrentConn in the read loop top could race. Low risk since the worst case is one extra loop iteration before exit, but worth noting.

✅ What looks good

  • Shadow snapshot (nextSessionRequestSnapshot) only commits on success — prevents failed rounds from corrupting session state
  • forceDisableIncrementalAfterAuthReset prevents stale previous_response_id after account switch
  • Tests cover the channel-close and session-close fast-fail paths
  • Consistent Chinese comments documenting lock ordering intent

📝 Test coverage gap

  • No test for the auth-switch path in ensureUpstreamConn (where currentAuthID != authID triggers invalidation). This is the core scenario the PR fixes — would be good to add a test that verifies a connection is torn down when the auth ID changes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e7cc1a4cc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if ch != nil {
// 会话关闭时允许主动 fail active 唤醒在途 readCodexWebsocketMessage
trySendCodexWebsocketRead(ch, done, codexWebsocketRead{conn: conn, err: fmt.Errorf("codex websockets executor: execution session closed")})
tryCloseCodexWebsocketRead(ch)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid closing active read channel during session teardown

closeExecutionSession now closes activeCh directly, but readUpstreamLoop still forwards text frames with a plain ch <- ... send. If CloseExecutionSession/closeAllExecutionSessions runs while a stream is in flight, that concurrent close(ch) can race with the sender and trigger a send on closed channel panic, crashing the process instead of failing the request cleanly.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex WebSocket 会话在账号额度耗尽后不会切换到其他可用账号

2 participants