Skip to content

fix: prevent nil pointer crash in OpenAI SSE tool call accumulation#182

Open
duhd-vnpay wants to merge 18 commits intonextlevelbuilder:mainfrom
duhd-vnpay:main
Open

fix: prevent nil pointer crash in OpenAI SSE tool call accumulation#182
duhd-vnpay wants to merge 18 commits intonextlevelbuilder:mainfrom
duhd-vnpay:main

Conversation

@duhd-vnpay
Copy link

Summary

  • Root cause fix: openai.goaccumulators map (keyed by SSE tool_call index) was iterated with for i := 0; i < len(map); i++, assuming contiguous keys 0,1,2... SSE indices from extended thinking models can be non-contiguous (e.g. {0, 2}), causing accumulators[i] → nil → panic on acc.rawArgs
  • Defense: lanes.go — add defer recover() in scheduler lane goroutine so panics log error + return semaphore token instead of crashing the process
  • Cleanup: tracing.go + gateway.go — add SweepOrphanTraces() to mark stuck running traces as error on startup (traces older than 1h with status running = orphan from crash)

Changed files

File Change
internal/providers/openai.go Iterate map by sorted keys instead of sequential 0..len-1
internal/scheduler/lanes.go Add recover() in goroutine defer
internal/store/tracing_store.go Add SweepOrphanTraces to interface
internal/store/pg/tracing.go Implement SweepOrphanTraces (UPDATE running→error)
cmd/gateway.go Call sweep on startup in background goroutine

Test results (18 tests, all PASS)

providers (9/9 PASS):

  • ContiguousToolCalls — indices 0,1 (happy path)
  • NonContiguousToolCallIndices — indices 0,2 skip 1 (the crash scenario)
  • LargeGapToolCallIndices — indices 0,5,10
  • SingleToolCallAtHighIndex — single call at index 3
  • ToolCallWithThoughtSignature — metadata preserved
  • TextOnly / EmptyToolCalls — no tool calls
  • HTTPError — 429 → proper HTTPError
  • CancelledContext — context cancel → error

scheduler (5/5 PASS):

  • NoPanic — normal operation
  • PanicRecovery — panic → lane recovers, accepts new work
  • MultiplePanics — 3 panics → all semaphore tokens returned
  • CancelledContext — cancelled ctx with full lane → error
  • StatsAfterPanic — active counter decremented after panic

store/pg (4/4 PASS, integration with PostgreSQL):

  • SweepsOldRunning — running trace 2h old → swept
  • IgnoresRecentRunning — running trace 5min old → NOT swept
  • IgnoresCompleted — completed trace → NOT swept
  • NoOrphans — no running traces → swept=0

🤖 Generated with Claude Code

duhd-vnpay and others added 18 commits March 10, 2026 15:44
…ing, photo handling

- Fix zaloBotInfo to use account_name/display_name (not name)
- Add Label() method for bot display name resolution
- Handle 3 response formats in getUpdates: array, single object, wrapped
- Add photo_url field to zaloMessage for Zalo CDN image URLs
- Add display_name/is_bot to zaloFrom, chat_type to zaloChat
- Use PhotoURL with fallback to Photo in handleImageMessage

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ssing

Zalo CDN URLs are auth-restricted and expire quickly, causing read_image
tool failures. Now downloads photos to temp files (like Telegram channel)
so the agent pipeline can base64-encode and process them normally.

Falls back to passing the URL directly if download fails.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… instance updates

When encryption key is empty, credentials stayed as map[string]any from
JSON unmarshal, causing pgx driver to fail encoding into bytea. Now
credentials are always marshaled to []byte regardless of encryption.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add Party Mode to GoClaw: structured multi-persona AI discussions with
Standard (single LLM call), Deep (parallel thinking + cross-talk), and
Token-Ring (sequential turns) modes.

Backend: PartyStore + PG implementation, party engine with parallel
goroutines, 7 RPC methods (party.start/round/question/add_context/
summary/exit/list), 10 WebSocket events, migration 000014.

Frontend: React dashboard page with session list, chat view, persona
sidebar, mode controls, start dialog with 6 team presets, i18n (en/vi/zh).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Previously, sanitizeHistory() only cleaned the in-memory copy for each
LLM request but never persisted the fix — causing the same "dropping
orphaned tool message" WARN to repeat on every single request forever.

Changes:
- sanitizeHistory() now returns drop count alongside cleaned messages
- When orphans are detected, cleaned history is persisted back to the
  session store via new SetHistory() method, then saved to DB
- Per-message WARN logs downgraded to DEBUG (cleanup is logged once
  at INFO level with total count)
- Added SetHistory() to SessionStore interface + both implementations

Co-Authored-By: Claude Opus 4.6 <[email protected]>
When a delegated agent (e.g. ui-ux-design-agent) spawns subagents, the
announce session key uses the format delegate:{uuid8}:{agentKey}:{id}.
The scheduler's RunFunc only handled agent:{agentId}:{rest} format,
falling back to the hardcoded "default" agent — which doesn't exist in
managed-mode deployments where the default agent has a custom key.

Add delegate: prefix parsing to extract the target agent key from
position 2 of the session key parts.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add sender_id and channel to team task metadata for audit trail
- Remove assistant prefill in team task reminder (thinking models reject it)
- Add unit tests for team access control and sender_id tracking

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…, mobile UX, budget, traces)

Resolved 13 file conflicts:
- zalo.go: kept local struct tags + upstream's n==0 safety check
- channel_instances.go: used upstream's credential merging (supersedes local bytea fix)
- factory.go, stores.go: merged both Party + Contacts/Activity/Snapshots stores
- loop_history.go: kept upstream skill inlining constants + local session persistence
- session_store.go, sessions/manager.go, sessions_ops.go: deduplicated SetHistory()
- sidebar.tsx, routes.tsx: added Party to upstream's restructured sidebar/routes
- protocol.ts, i18n/index.ts, sidebar.json (×3): merged Party + upstream events/translations
- version.go: bumped to schema version 18

Migration collision fix: renamed 000014_party_sessions → 000018_party_sessions

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add nullish coalescing for PERSONA_COLORS array index
- Remove unused destructured `round` variable from PartyPage

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…otocol

Backend:
- gateway_providers: read default_model from provider settings JSONB
- party.go: getEngine() prefers providers with DefaultModel, alphabetical
  fallback for determinism (fixes random Go map iteration)
- party.go: add slog.Error for round failures (was silently swallowed)

Frontend:
- use-party.ts: align all RPC params/events with snake_case wire format,
  add transformSession/mapStatus/selectSession helpers
- party-start-dialog.tsx: use actual DB persona keys (morpheus-persona etc.)
- party-page.tsx: use selectSession() for proper state hydration
- connection-status.tsx: fix status text alignment

Co-Authored-By: Claude Opus 4.6 <[email protected]>
transformSession() was dropping history/summary fields from backend
response, and selectSession() reset messages to empty array. Old
sessions appeared blank when clicked in sidebar.

- transformSession: preserve _history and _summary from backend
- hydrateMessages: new helper to convert RoundResult[] + SummaryResult
  into PartyMessage[] (round headers, persona messages, summary)
- selectSession: call hydrateMessages() instead of setMessages([])

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Root cause: `accumulators` map iterated with sequential `for i := 0; i < len(map)`
but SSE tool_call indices can be non-contiguous (e.g. {0, 2}), causing nil dereference
on `accumulators[i]` when key `i` doesn't exist.

Fixes:
1. openai.go: iterate map by sorted keys instead of sequential 0..len-1
2. lanes.go: add defer recover() in scheduler goroutine to prevent panics from
   crashing the entire process — logs error and returns semaphore token
3. tracing: add SweepOrphanTraces() to mark stuck running traces as error on
   gateway startup (running > 1h = orphan from previous crash)

Test results (18 tests):
- providers: 9/9 PASS (contiguous, non-contiguous, large gap, high index,
  thought_signature, text-only, empty, HTTP error, cancelled context)
- scheduler: 5/5 PASS (no panic, panic recovery, multiple panics, cancelled
  context, stats after panic)
- store/pg: 4/4 PASS (sweeps old running, ignores recent, ignores completed,
  no orphans)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… loops

When LLM hits max_tokens, tool call arguments may be incomplete/malformed JSON.
- Add truncation guard: skip tool execution when finish_reason=length, ask LLM to retry smaller
- Wire per-agent max_tokens from other_config JSONB (default 8192)
- Log warning on JSON parse failure for non-empty tool call arguments (truncation indicator)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant