octos-org · ymote · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/README.md b/README.md
@@ -142,6 +142,108 @@ happy path.
 Total estimated spend: **$8.37 / $10.00 cap** (DeepSeek Flash $0.03 +
 DeepSeek Pro $0.20 + Sonnet 4.6 $1.48 + Opus 4.7 $6.65).
 
+## Results: 2026-05-28 — Kimi via autodl/wisemodel relay
+
+This run was provoked by a live production incident on the octos fleet
+(`mini3`): a long-running multi-turn session hit a loop where the LLM called
+the same tool 5 times with identical args before the runtime broke it. The
+runtime emitted a generic warning at the time —
+`high tool count may cause empty responses with some models; tools=44` —
+and that warning got attributed as the root cause. This run measures
+whether the production-routed kimi models actually degrade at the tool
+counts in question (10 → 500, bracketing the observed 44), routed through
+the same path mini3 uses (`autodl.art` / wisemodel relay in front of
+Moonshot).
+
+| Model | N=10 | N=30 | **N=44** | N=50 | N=100 | N=200 | N=500 |
+|---|---|---|---|---|---|---|---|
+| `kimi-k2.5` | 1321 / 1522 ms<br>rel=100% | 1331 / 1614 ms<br>rel=100% | **1572 / 1970 ms<br>rel=100%** | 1543 / 1930 ms<br>rel=100% | 1547 / 1897 ms<br>rel=100% | 2065 / 2329 ms<br>rel=100% | 2752 / 3476 ms<br>rel=100% |
+| `kimi-k2.6` | 1216 / 1424 ms<br>rel=100% | 1361 / 1711 ms<br>rel=100% | **1566 / 1905 ms<br>rel=100%** | 1693 / 2218 ms<br>rel=100% | 1592 / 2065 ms<br>rel=100% | 2074 / 2346 ms<br>rel=100% | 2954 / 3266 ms<br>rel=100% |
+
+Trial-level CSV at [`results/2026-05-28/kimi-all.csv`](results/2026-05-28/kimi-all.csv).
+JSONL + per-cell summaries in `results/2026-05-28/`.
+
+### Headline insight
+
+**Tool count was not the cause.** Both production-routed kimi models hit
+100% relevant-tool selection at every N up to 500, with median total
+latency under 3.5s even at N=500 (~23k input tokens). The exact N=44
+measurement is clean (rel=100%, total ~2s). The runtime warning was a
+generic heuristic; this benchmark refutes it for the regime that
+matters.
+
+What this **does not** measure:
+
+- Multi-turn behaviour (the production incident had `messages=139`,
+  `input_tokens=53k`, conversational history of repeated tool calls and
+  results). This benchmark is single-turn.
+- Cumulative tool-result feedback (the failing turn had already seen
+  the same tool's output earlier in history — the degradation may come
+  from the prompt-shape of repeated stale evidence, not from the tool
+  catalog).
+- autodl/wisemodel proxy buffering at very long context. The largest
+  context measured here is ~23k input tokens at N=500; the production
+  failure was at ~53k input tokens.
+
+If the loop recurs, the next experiment to add to this repo is a
+multi-turn mode that replays the failing conversation shape, not more
+N-padding on single-turn requests. Tool count is not the variable.
+
+### Follow-up: replay of the actual failing conversation (2026-05-28)
+
+After the single-turn sweep above ruled out tool-count, we ran two
+follow-up experiments to find the real cause:
+
+1. **Synthetic single-turn termination-signal test** (`scripts/termination_signal_repro.py`):
+   build a 5-message conversation where the user asks "is the deck
+   done?" and the assistant has just received a `check_workspace_contract`
+   result, varying whether the result carries the explicit `all_ready`
+   flag. Kimi and Opus both responded with text 5/5 in every arm. The
+   "missing termination signal" hypothesis we'd started with was
+   refuted.
+
+2. **Full-history replay** (`scripts/termination_signal_replay.py`):
+   load the actual 89-message session JSONL from the failing mini3
+   session up to the user's trigger message ("是你的 manifest 文件
+   生成问题，图片都存在" / "the manifest file is the problem, the
+   images exist"), then ask kimi-k2.5 and claude-opus-4.7 what they do
+   next. Two arms each: only `check_workspace_contract` available vs
+   that tool plus `read_file` and `list_dir`.
+
+| Arm | Result |
+|---|---|
+| A: kimi + `check_only` | **LOOP=3/3** (reproduced production failure) |
+| B: kimi + `check + read_file + list_dir` | LOOP=2/3, OTHER_TOOL=1/3 |
+| C: opus + `check_only` | **LOOP=3/3** (Opus loops too) |
+| D: opus + `check + read_file + list_dir` | LOOP=0/3, OTHER_TOOL=3/3 |
+
+The loop is reproducible at 100% in arms A and C (`check_workspace_contract` is the only relevant tool). The frontier Claude Opus 4.7 model loops just as hard as kimi in that condition — this is not a model-quality issue at the tool-selection layer when no tool can actually answer the user's question.
+
+What broke production was **tool-catalog mismatch**: the user asked
+about manifest content; the only directly-relevant tool answers
+artifact presence. The LLM kept re-querying because no available tool
+could answer what was actually asked. Opus, when given `list_dir` as
+an alternative, switches every time (arm D). Kimi is stickier on its
+first tool choice — it switches 1/3 (arm B).
+
+**The 4 KB tree without `all_ready` was not the cause.** The original
+`check_workspace_contract` answered a different question than the
+user asked. Production already had `read_file` available (tools=44);
+kimi's tool-stickiness kept it locked on the first tool it picked.
+
+The implication for tool design (logged on the octos side for
+follow-up): document what each tool does NOT answer, not just what it
+does, so the LLM has a clearer prior on when to switch.
+
+Trial-level JSONL at [`results/2026-05-28/termination-repro.jsonl`](results/2026-05-28/termination-repro.jsonl)
+and [`results/2026-05-28/termination-replay.jsonl`](results/2026-05-28/termination-replay.jsonl).
+
+### Cost
+
+Total estimated spend on this run: **$0.00** (autodl's own usage frame
+returns no per-trial cost; spend recorded as zero by `bench.py`'s
+estimator).
+
 ## License
 
 Apache-2.0. See [LICENSE](LICENSE).
diff --git a/bench.py b/bench.py
@@ -362,7 +362,7 @@ def main() -> int:
     parser.add_argument(
         "--provider",
         default="openrouter",
-        choices=["openrouter", "deepseek", "anthropic", "zhipu", "zai"],
+        choices=["openrouter", "deepseek", "anthropic", "zhipu", "zai", "autodl", "moonshot"],
     )
     parser.add_argument("--model", action="append", default=None,
                         help="Pass multiple times for multiple models.")

diff --git a/configs/run-2026-05-28-kimi.yaml b/configs/run-2026-05-28-kimi.yaml
@@ -0,0 +1,30 @@
+# Sweep config for the 2026-05-28 kimi-k2.5 production-routing run.
+#
+# Motivation: live mini3 soak (session slides-1780013669236-8w2ime) hit a
+# 5-iteration loop where the runtime warning fired:
+#   "high tool count may cause empty responses with some models; tools=44"
+# We need to know whether kimi-k2.5 — as routed in production via
+# autodl.art — actually degrades at 44+ tools, or whether the warning was
+# noise and the degradation came from a different variable (long context,
+# wisemodel proxy buffering, etc.).
+#
+# N values are chosen to bracket 44 so we get a clear before/after signal
+# around the production tool count. We skip N=1000 to keep spend bounded
+# on the relay endpoint.
+#
+# Trials: 3 per (model, N).
+# Spend cap: $5 hard (autodl pricing is unfamiliar; cap small).
+defaults:
+  timeout: 120.0
+  relevant_position: first
+
+runs:
+  - provider: autodl
+    model: kimi-k2.5
+    n: [10, 30, 44, 50, 100, 200, 500]
+    trials: 3
+
+  - provider: autodl
+    model: kimi-k2.6
+    n: [10, 30, 44, 50, 100, 200, 500]
+    trials: 3
diff --git a/providers/base.py b/providers/base.py
@@ -377,6 +377,36 @@ def make_provider(provider: str) -> ProviderConfig:
                 "Content-Type": "application/json",
             },
         )
+    if provider == "autodl":
+        # Production routing for moonshot/kimi-k2.5 on the octos dspfac fleet.
+        # The endpoint speaks OpenAI protocol but is operated by autodl.art /
+        # wisemodel as a commercial relay in front of Moonshot. Behaviour
+        # diverges from api.moonshot.ai direct (different rate limit, different
+        # streaming buffering); this provider entry exists to measure the
+        # production regime, not the model in isolation.
+        key = _require_env("AUTODL_API_KEY")
+        return ProviderConfig(
+            name="autodl",
+            url="https://www.autodl.art/api/v1/chat/completions",
+            protocol="openai",
+            headers_fn=lambda _model: {
+                "Authorization": f"Bearer {key}",
+                "Content-Type": "application/json",
+            },
+        )
+    if provider == "moonshot":
+        # Moonshot direct (api.moonshot.ai). Lets us separate "model itself" from
+        # "production relay (autodl) behaviour" when both keys are available.
+        key = _require_env("MOONSHOT_API_KEY")
+        return ProviderConfig(
+            name="moonshot",
+            url="https://api.moonshot.ai/v1/chat/completions",
+            protocol="openai",
+            headers_fn=lambda _model: {
+                "Authorization": f"Bearer {key}",
+                "Content-Type": "application/json",
+            },
+        )
     if provider == "zai":
         key = _require_env("ZAI_API_KEY")
         return ProviderConfig(

diff --git a/results/2026-05-28/kimi-all.csv b/results/2026-05-28/kimi-all.csv
@@ -0,0 +1,43 @@
+model,n_tools,trial,ttft_ms,total_ms,emitted_tool_call,tool_call_name,selected_relevant,finish_reason,response_started,http_status,error,input_tokens,output_tokens
+kimi-k2.5,10,0,1401.475999969989,1592.0816250145435,True,get_weather,True,tool_calls,True,200,,567,19
+kimi-k2.5,10,1,1258.3429580554366,1510.1907080970705,True,get_weather,True,tool_calls,True,200,,567,19
+kimi-k2.5,10,2,1321.4867911301553,1521.6505001299083,True,get_weather,True,tool_calls,True,200,,567,19
+kimi-k2.5,30,0,1440.3653340414166,1613.7961251661181,True,get_weather,True,tool_calls,True,200,,1634,19
+kimi-k2.5,30,1,1331.3715420663357,1685.344542376697,True,get_weather,True,tool_calls,True,200,,1634,25
+kimi-k2.5,30,2,1247.8924999013543,1527.93433284387,True,get_weather,True,tool_calls,True,200,,1634,19
+kimi-k2.5,44,0,1469.7794998064637,1902.680415660143,True,get_weather,True,tool_calls,True,200,,2278,25
+kimi-k2.5,44,1,1699.4920000433922,1980.4046670906246,True,get_weather,True,tool_calls,True,200,,2278,19
+kimi-k2.5,44,2,1571.8187079764903,1969.8556666262448,True,get_weather,True,tool_calls,True,200,,2278,25
+kimi-k2.5,50,0,1482.4839583598077,1929.6910003758967,True,get_weather,True,tool_calls,True,200,,2527,25
+kimi-k2.5,50,1,1624.3552910163999,1898.1102909892797,True,get_weather,True,tool_calls,True,200,,2527,19
+kimi-k2.5,50,2,1542.7984166890383,1936.8925830349326,True,get_weather,True,tool_calls,True,200,,2527,25
+kimi-k2.5,100,0,1701.4469159767032,1899.6567083522677,True,get_weather,True,tool_calls,True,200,,4790,19
+kimi-k2.5,100,1,1444.016749970615,1659.4132906757295,True,get_weather,True,tool_calls,True,200,,4790,19
+kimi-k2.5,100,2,1546.6624172404408,1897.4306671880186,True,get_weather,True,tool_calls,True,200,,4790,19
+kimi-k2.5,200,0,2064.858333207667,2329.3212922289968,True,get_weather,True,tool_calls,True,200,,9461,19
+kimi-k2.5,200,1,2027.2286250256002,2169.1031251102686,True,get_weather,True,tool_calls,True,200,,9461,19
+kimi-k2.5,200,2,2229.511708021164,2622.2758749499917,True,get_weather,True,tool_calls,True,200,,9461,25
+kimi-k2.5,500,0,2751.9294172525406,3476.1431673541665,True,get_weather,True,tool_calls,True,200,,23424,29
+kimi-k2.5,500,1,2230.998042039573,2962.730874773115,True,get_weather,True,tool_calls,True,200,,23424,29
+kimi-k2.5,500,2,5107.52824973315,5601.727457717061,True,get_weather,True,tool_calls,True,200,,23424,25
+kimi-k2.6,10,0,1986.986625008285,2247.668833937496,True,get_weather,True,tool_calls,True,200,,567,19
+kimi-k2.6,10,1,1146.7030001804233,1405.7195419445634,True,get_weather,True,tool_calls,True,200,,567,19
+kimi-k2.6,10,2,1216.0448753274977,1423.7766251899302,True,get_weather,True,tool_calls,True,200,,567,18
+kimi-k2.6,30,0,1365.493499673903,1710.8979169279337,True,get_weather,True,tool_calls,True,200,,1634,19
+kimi-k2.6,30,1,1361.1227083019912,1816.4757913909853,True,get_weather,True,tool_calls,True,200,,1634,25
+kimi-k2.6,30,2,1298.269959166646,1702.3757919669151,True,get_weather,True,tool_calls,True,200,,1634,25
+kimi-k2.6,44,0,1562.399832997471,1810.7880419120193,True,get_weather,True,tool_calls,True,200,,2278,19
+kimi-k2.6,44,1,1565.6340001150966,1977.5530002079904,True,get_weather,True,tool_calls,True,200,,2278,25
+kimi-k2.6,44,2,1577.0149580202997,1905.4455826990306,True,get_weather,True,tool_calls,True,200,,2278,19
+kimi-k2.6,50,0,2526.763124857098,2902.31362497434,True,get_weather,True,tool_calls,True,200,,2527,25
+kimi-k2.6,50,1,1692.777959164232,2217.6276249811053,True,get_weather,True,tool_calls,True,200,,2527,25
+kimi-k2.6,50,2,1439.3659168854356,1845.2294575981796,True,get_weather,True,tool_calls,True,200,,2527,25
+kimi-k2.6,100,0,2434.119542129338,2628.9280001074076,True,get_weather,True,tool_calls,True,200,,4790,25
+kimi-k2.6,100,1,1591.6625000536442,2065.4458748176694,True,get_weather,True,tool_calls,True,200,,4790,25
+kimi-k2.6,100,2,1393.5351655818522,1607.8764577396214,True,get_weather,True,tool_calls,True,200,,4790,19
+kimi-k2.6,200,0,2074.4173750281334,2345.9046250209212,True,get_weather,True,tool_calls,True,200,,9461,19
+kimi-k2.6,200,1,1850.7448751479387,2253.8724592886865,True,get_weather,True,tool_calls,True,200,,9461,25
+kimi-k2.6,200,2,2208.9466666802764,2624.054624699056,True,get_weather,True,tool_calls,True,200,,9461,25
+kimi-k2.6,500,0,3145.9837090224028,3679.9117093905807,True,get_weather,True,tool_calls,True,200,,23424,25
+kimi-k2.6,500,1,2440.7394169829786,2966.8034580536187,True,get_weather,True,tool_calls,True,200,,23424,25
+kimi-k2.6,500,2,2953.866207972169,3266.4410420693457,True,get_weather,True,tool_calls,True,200,,23424,19
diff --git a/results/2026-05-28/kimi-summary.md b/results/2026-05-28/kimi-summary.md
@@ -0,0 +1,4 @@
+| Model | N=10 | N=30 | N=44 | N=50 | N=100 | N=200 | N=500 |
+|---|---|---|---|---|---|---|---|
+| kimi-k2.5 | 1321/1522ms (rel=100%) | 1331/1614ms (rel=100%) | 1572/1970ms (rel=100%) | 1543/1930ms (rel=100%) | 1547/1897ms (rel=100%) | 2065/2329ms (rel=100%) | 2752/3476ms (rel=100%) |
+| kimi-k2.6 | 1216/1424ms (rel=100%) | 1361/1711ms (rel=100%) | 1566/1905ms (rel=100%) | 1693/2218ms (rel=100%) | 1592/2065ms (rel=100%) | 2074/2346ms (rel=100%) | 2954/3266ms (rel=100%) |