Skip to content

AmdClaudeModel hangs indefinitely on stalled gateway requests (no wall-clock timeout) #270

Description

@peyron-amd

Summary

AmdClaudeModel (src/minisweagent/models/amd_claude.py) can hang indefinitely when the AMD LLM gateway stalls a request (connection accepted but no response bytes ever sent). The anthropic.Anthropic client is constructed without a timeout, so the socket read blocks forever. The tenacity @retry on _query_api never fires because no exception is ever raised.
This path is used by all GEAK pipelines — translation, optimization, unit-test, select-patch all set model_class: amd_llm, and the amd_llm router delegates every claude-* model to AmdClaudeModel.

Impact

  • A single stalled request hangs the entire preprocess/agent loop with no recovery.
  • The process stays alive but makes no progress, so it consumes its full run budget and must be killed externally, often yielding only partial results.

Evidence

Observed repeatedly with claude-opus-4.8:

  • Worker process alive but at 0.0% CPU for 45m–3h, frozen immediately after printing Round 1/3....
  • No rate-limit / 429 / traceback in logs — a clean silent stall.
  • Hung on different kernels each time (14_Gemm_Divide_Sum_Scaling, 43_MinGPTCausalAttention, even the trivial 19_ReLU) → random gateway stall, not workload-specific.

Root cause

self.client = anthropic.Anthropic(
    api_key="dummy",
    base_url=base_url,
    default_headers={...},   # no timeout, no max_retries
)

Suggested solution here: #271

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions