Skip to content

periodic maintenance blocks gateway indefinitely — no LLM timeout in fetch calls #49

@liizfq

Description

@liizfq

Bug Description

graph-memory plugin's periodic maintenance (community summarization) blocks the entire OpenClaw gateway indefinitely when an LLM call hangs or takes too long, because the fetch() in engine/llm.ts has no timeout mechanism (no AbortController, no AbortSignal.timeout).

Once maintenance is stuck, the embedded run never completes, which blocks gateway restart/reload. OpenClaw then enters a restart loop or hangs for 10+ minutes until systemd hard-kills it.

Environment

  • OpenClaw v2026.3.23-2
  • graph-memory v1.5.5
  • LLM via LiteLLM proxy (glm-5 / minimax-m2.7, both thinking models ~8-15s per call)
  • 178-185 communities detected

Root Cause

In src/engine/llm.ts, the createCompleteFn function calls fetch() without any timeout:

const res = await fetch(`${baseURL}/chat/completions`, {
  method: "POST",
  headers: { ... },
  body: JSON.stringify({ ... }),
  // ❌ No AbortSignal.timeout() or AbortController
});

In src/graph/community.ts, summarizeCommunities() calls llm() serially for every community (185 communities × 1 call each). If any single call hangs, the entire maintenance process blocks forever.

Reproduction

  1. Configure graph-memory with an LLM via LiteLLM proxy
  2. Let session accumulate enough messages to trigger periodic maintenance
  3. Maintenance starts → summarizeCommunities loops through 178+ communities
  4. If any LLM call takes >60s (common with thinking models like glm-5), gateway hangs

Impact

  • Gateway stops responding to new messages
  • Gateway restart is blocked ("waiting for 1 embedded run")
  • After 5min restart timeout → 90s drain timeout → hard exit
  • Cycle repeats with systemd restart loop

Suggested Fixes

  1. Add AbortSignal.timeout() to fetch calls in engine/llm.ts:

    const res = await fetch(url, {
      ...options,
      signal: AbortSignal.timeout(30_000), // 30s timeout
    });
  2. Add per-LLM-call timeout config in plugin config schema

  3. Optionally: batch community summarization instead of 185 serial calls

  4. Optionally: add maintenanceInterval config to control how often maintenance runs

Logs

10:19:26 [graph-memory] periodic maintenance (turn 7): communities=185
10:20:04 [graph-memory] extract interrupted after LLM call, discarding
10:20:04 → 10:26:00+ (no logs, gateway frozen)

Full restart loop trace:

09:31:18 [reload] config change requires gateway restart — deferring until 1 embedded run(s) complete
09:36:19 [reload] restart timeout after 300336ms with 1 embedded run(s) still active; restarting anyway
09:36:19 [gateway] draining 1 active embedded run(s) before restart (timeout 90000ms)
09:37:49 [diagnostic] wait for active embedded runs timed out: activeRuns=1 timeoutMs=90000
09:38:14 [gateway] shutdown timed out; exiting without full cleanup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions