Bug Description
graph-memory plugin's periodic maintenance (community summarization) blocks the entire OpenClaw gateway indefinitely when an LLM call hangs or takes too long, because the fetch() in engine/llm.ts has no timeout mechanism (no AbortController, no AbortSignal.timeout).
Once maintenance is stuck, the embedded run never completes, which blocks gateway restart/reload. OpenClaw then enters a restart loop or hangs for 10+ minutes until systemd hard-kills it.
Environment
- OpenClaw v2026.3.23-2
- graph-memory v1.5.5
- LLM via LiteLLM proxy (glm-5 / minimax-m2.7, both thinking models ~8-15s per call)
- 178-185 communities detected
Root Cause
In src/engine/llm.ts, the createCompleteFn function calls fetch() without any timeout:
const res = await fetch(`${baseURL}/chat/completions`, {
method: "POST",
headers: { ... },
body: JSON.stringify({ ... }),
// ❌ No AbortSignal.timeout() or AbortController
});
In src/graph/community.ts, summarizeCommunities() calls llm() serially for every community (185 communities × 1 call each). If any single call hangs, the entire maintenance process blocks forever.
Reproduction
- Configure graph-memory with an LLM via LiteLLM proxy
- Let session accumulate enough messages to trigger
periodic maintenance
- Maintenance starts →
summarizeCommunities loops through 178+ communities
- If any LLM call takes >60s (common with thinking models like glm-5), gateway hangs
Impact
- Gateway stops responding to new messages
- Gateway restart is blocked ("waiting for 1 embedded run")
- After 5min restart timeout → 90s drain timeout → hard exit
- Cycle repeats with systemd restart loop
Suggested Fixes
-
Add AbortSignal.timeout() to fetch calls in engine/llm.ts:
const res = await fetch(url, {
...options,
signal: AbortSignal.timeout(30_000), // 30s timeout
});
-
Add per-LLM-call timeout config in plugin config schema
-
Optionally: batch community summarization instead of 185 serial calls
-
Optionally: add maintenanceInterval config to control how often maintenance runs
Logs
10:19:26 [graph-memory] periodic maintenance (turn 7): communities=185
10:20:04 [graph-memory] extract interrupted after LLM call, discarding
10:20:04 → 10:26:00+ (no logs, gateway frozen)
Full restart loop trace:
09:31:18 [reload] config change requires gateway restart — deferring until 1 embedded run(s) complete
09:36:19 [reload] restart timeout after 300336ms with 1 embedded run(s) still active; restarting anyway
09:36:19 [gateway] draining 1 active embedded run(s) before restart (timeout 90000ms)
09:37:49 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
09:38:14 [gateway] shutdown timed out; exiting without full cleanup
Bug Description
graph-memoryplugin'speriodic maintenance(community summarization) blocks the entire OpenClaw gateway indefinitely when an LLM call hangs or takes too long, because thefetch()inengine/llm.tshas no timeout mechanism (noAbortController, noAbortSignal.timeout).Once maintenance is stuck, the embedded run never completes, which blocks gateway restart/reload. OpenClaw then enters a restart loop or hangs for 10+ minutes until systemd hard-kills it.
Environment
Root Cause
In
src/engine/llm.ts, thecreateCompleteFnfunction callsfetch()without any timeout:In
src/graph/community.ts,summarizeCommunities()callsllm()serially for every community (185 communities × 1 call each). If any single call hangs, the entire maintenance process blocks forever.Reproduction
periodic maintenancesummarizeCommunitiesloops through 178+ communitiesImpact
Suggested Fixes
Add
AbortSignal.timeout()to fetch calls inengine/llm.ts:Add per-LLM-call timeout config in plugin config schema
Optionally: batch community summarization instead of 185 serial calls
Optionally: add
maintenanceIntervalconfig to control how often maintenance runsLogs
Full restart loop trace: