Summary
AmdClaudeModel (src/minisweagent/models/amd_claude.py) can hang indefinitely when the AMD LLM gateway stalls a request (connection accepted but no response bytes ever sent). The anthropic.Anthropic client is constructed without a timeout, so the socket read blocks forever. The tenacity @retry on _query_api never fires because no exception is ever raised.
This path is used by all GEAK pipelines — translation, optimization, unit-test, select-patch all set model_class: amd_llm, and the amd_llm router delegates every claude-* model to AmdClaudeModel.
Impact
- A single stalled request hangs the entire preprocess/agent loop with no recovery.
- The process stays alive but makes no progress, so it consumes its full run budget and must be killed externally, often yielding only partial results.
Evidence
Observed repeatedly with claude-opus-4.8:
- Worker process alive but at 0.0% CPU for 45m–3h, frozen immediately after printing
Round 1/3....
- No rate-limit / 429 / traceback in logs — a clean silent stall.
- Hung on different kernels each time (
14_Gemm_Divide_Sum_Scaling, 43_MinGPTCausalAttention, even the trivial 19_ReLU) → random gateway stall, not workload-specific.
Root cause
self.client = anthropic.Anthropic(
api_key="dummy",
base_url=base_url,
default_headers={...}, # no timeout, no max_retries
)
Suggested solution here: #271
Summary
AmdClaudeModel(src/minisweagent/models/amd_claude.py) can hang indefinitely when the AMD LLM gateway stalls a request (connection accepted but no response bytes ever sent). Theanthropic.Anthropicclient is constructed without atimeout, so the socket read blocks forever. The tenacity@retryon_query_apinever fires because no exception is ever raised.This path is used by all GEAK pipelines — translation, optimization, unit-test, select-patch all set
model_class: amd_llm, and theamd_llmrouter delegates everyclaude-*model toAmdClaudeModel.Impact
Evidence
Observed repeatedly with
claude-opus-4.8:Round 1/3....14_Gemm_Divide_Sum_Scaling,43_MinGPTCausalAttention, even the trivial19_ReLU) → random gateway stall, not workload-specific.Root cause
Suggested solution here: #271