Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,108 @@ happy path.
Total estimated spend: **$8.37 / $10.00 cap** (DeepSeek Flash $0.03 +
DeepSeek Pro $0.20 + Sonnet 4.6 $1.48 + Opus 4.7 $6.65).

## Results: 2026-05-28 — Kimi via autodl/wisemodel relay

This run was provoked by a live production incident on the octos fleet
(`mini3`): a long-running multi-turn session hit a loop where the LLM called
the same tool 5 times with identical args before the runtime broke it. The
runtime emitted a generic warning at the time —
`high tool count may cause empty responses with some models; tools=44` —
and that warning got attributed as the root cause. This run measures
whether the production-routed kimi models actually degrade at the tool
counts in question (10 → 500, bracketing the observed 44), routed through
the same path mini3 uses (`autodl.art` / wisemodel relay in front of
Moonshot).

| Model | N=10 | N=30 | **N=44** | N=50 | N=100 | N=200 | N=500 |
|---|---|---|---|---|---|---|---|
| `kimi-k2.5` | 1321 / 1522 ms<br>rel=100% | 1331 / 1614 ms<br>rel=100% | **1572 / 1970 ms<br>rel=100%** | 1543 / 1930 ms<br>rel=100% | 1547 / 1897 ms<br>rel=100% | 2065 / 2329 ms<br>rel=100% | 2752 / 3476 ms<br>rel=100% |
| `kimi-k2.6` | 1216 / 1424 ms<br>rel=100% | 1361 / 1711 ms<br>rel=100% | **1566 / 1905 ms<br>rel=100%** | 1693 / 2218 ms<br>rel=100% | 1592 / 2065 ms<br>rel=100% | 2074 / 2346 ms<br>rel=100% | 2954 / 3266 ms<br>rel=100% |

Trial-level CSV at [`results/2026-05-28/kimi-all.csv`](results/2026-05-28/kimi-all.csv).
JSONL + per-cell summaries in `results/2026-05-28/`.

### Headline insight

**Tool count was not the cause.** Both production-routed kimi models hit
100% relevant-tool selection at every N up to 500, with median total
latency under 3.5s even at N=500 (~23k input tokens). The exact N=44
measurement is clean (rel=100%, total ~2s). The runtime warning was a
generic heuristic; this benchmark refutes it for the regime that
matters.

What this **does not** measure:

- Multi-turn behaviour (the production incident had `messages=139`,
`input_tokens=53k`, conversational history of repeated tool calls and
results). This benchmark is single-turn.
- Cumulative tool-result feedback (the failing turn had already seen
the same tool's output earlier in history — the degradation may come
from the prompt-shape of repeated stale evidence, not from the tool
catalog).
- autodl/wisemodel proxy buffering at very long context. The largest
context measured here is ~23k input tokens at N=500; the production
failure was at ~53k input tokens.

If the loop recurs, the next experiment to add to this repo is a
multi-turn mode that replays the failing conversation shape, not more
N-padding on single-turn requests. Tool count is not the variable.

### Follow-up: replay of the actual failing conversation (2026-05-28)

After the single-turn sweep above ruled out tool-count, we ran two
follow-up experiments to find the real cause:

1. **Synthetic single-turn termination-signal test** (`scripts/termination_signal_repro.py`):
build a 5-message conversation where the user asks "is the deck
done?" and the assistant has just received a `check_workspace_contract`
result, varying whether the result carries the explicit `all_ready`
flag. Kimi and Opus both responded with text 5/5 in every arm. The
"missing termination signal" hypothesis we'd started with was
refuted.

2. **Full-history replay** (`scripts/termination_signal_replay.py`):
load the actual 89-message session JSONL from the failing mini3
session up to the user's trigger message ("是你的 manifest 文件
生成问题,图片都存在" / "the manifest file is the problem, the
images exist"), then ask kimi-k2.5 and claude-opus-4.7 what they do
next. Two arms each: only `check_workspace_contract` available vs
that tool plus `read_file` and `list_dir`.

| Arm | Result |
|---|---|
| A: kimi + `check_only` | **LOOP=3/3** (reproduced production failure) |
| B: kimi + `check + read_file + list_dir` | LOOP=2/3, OTHER_TOOL=1/3 |
| C: opus + `check_only` | **LOOP=3/3** (Opus loops too) |
| D: opus + `check + read_file + list_dir` | LOOP=0/3, OTHER_TOOL=3/3 |

The loop is reproducible at 100% in arms A and C (`check_workspace_contract` is the only relevant tool). The frontier Claude Opus 4.7 model loops just as hard as kimi in that condition — this is not a model-quality issue at the tool-selection layer when no tool can actually answer the user's question.

What broke production was **tool-catalog mismatch**: the user asked
about manifest content; the only directly-relevant tool answers
artifact presence. The LLM kept re-querying because no available tool
could answer what was actually asked. Opus, when given `list_dir` as
an alternative, switches every time (arm D). Kimi is stickier on its
first tool choice — it switches 1/3 (arm B).

**The 4 KB tree without `all_ready` was not the cause.** The original
`check_workspace_contract` answered a different question than the
user asked. Production already had `read_file` available (tools=44);
kimi's tool-stickiness kept it locked on the first tool it picked.

The implication for tool design (logged on the octos side for
follow-up): document what each tool does NOT answer, not just what it
does, so the LLM has a clearer prior on when to switch.

Trial-level JSONL at [`results/2026-05-28/termination-repro.jsonl`](results/2026-05-28/termination-repro.jsonl)
and [`results/2026-05-28/termination-replay.jsonl`](results/2026-05-28/termination-replay.jsonl).

### Cost

Total estimated spend on this run: **$0.00** (autodl's own usage frame
returns no per-trial cost; spend recorded as zero by `bench.py`'s
estimator).

## License

Apache-2.0. See [LICENSE](LICENSE).
2 changes: 1 addition & 1 deletion bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ def main() -> int:
parser.add_argument(
"--provider",
default="openrouter",
choices=["openrouter", "deepseek", "anthropic", "zhipu", "zai"],
choices=["openrouter", "deepseek", "anthropic", "zhipu", "zai", "autodl", "moonshot"],
)
parser.add_argument("--model", action="append", default=None,
help="Pass multiple times for multiple models.")
Expand Down
30 changes: 30 additions & 0 deletions configs/run-2026-05-28-kimi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Sweep config for the 2026-05-28 kimi-k2.5 production-routing run.
#
# Motivation: live mini3 soak (session slides-1780013669236-8w2ime) hit a
# 5-iteration loop where the runtime warning fired:
# "high tool count may cause empty responses with some models; tools=44"
# We need to know whether kimi-k2.5 — as routed in production via
# autodl.art — actually degrades at 44+ tools, or whether the warning was
# noise and the degradation came from a different variable (long context,
# wisemodel proxy buffering, etc.).
#
# N values are chosen to bracket 44 so we get a clear before/after signal
# around the production tool count. We skip N=1000 to keep spend bounded
# on the relay endpoint.
#
# Trials: 3 per (model, N).
# Spend cap: $5 hard (autodl pricing is unfamiliar; cap small).
defaults:
timeout: 120.0
relevant_position: first

runs:
- provider: autodl
model: kimi-k2.5
n: [10, 30, 44, 50, 100, 200, 500]
trials: 3

- provider: autodl
model: kimi-k2.6
n: [10, 30, 44, 50, 100, 200, 500]
trials: 3
30 changes: 30 additions & 0 deletions providers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -377,6 +377,36 @@ def make_provider(provider: str) -> ProviderConfig:
"Content-Type": "application/json",
},
)
if provider == "autodl":
# Production routing for moonshot/kimi-k2.5 on the octos dspfac fleet.
# The endpoint speaks OpenAI protocol but is operated by autodl.art /
# wisemodel as a commercial relay in front of Moonshot. Behaviour
# diverges from api.moonshot.ai direct (different rate limit, different
# streaming buffering); this provider entry exists to measure the
# production regime, not the model in isolation.
key = _require_env("AUTODL_API_KEY")
return ProviderConfig(
name="autodl",
url="https://www.autodl.art/api/v1/chat/completions",
protocol="openai",
headers_fn=lambda _model: {
"Authorization": f"Bearer {key}",
"Content-Type": "application/json",
},
)
if provider == "moonshot":
# Moonshot direct (api.moonshot.ai). Lets us separate "model itself" from
# "production relay (autodl) behaviour" when both keys are available.
key = _require_env("MOONSHOT_API_KEY")
return ProviderConfig(
name="moonshot",
url="https://api.moonshot.ai/v1/chat/completions",
protocol="openai",
headers_fn=lambda _model: {
"Authorization": f"Bearer {key}",
"Content-Type": "application/json",
},
)
if provider == "zai":
key = _require_env("ZAI_API_KEY")
return ProviderConfig(
Expand Down
43 changes: 43 additions & 0 deletions results/2026-05-28/kimi-all.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
model,n_tools,trial,ttft_ms,total_ms,emitted_tool_call,tool_call_name,selected_relevant,finish_reason,response_started,http_status,error,input_tokens,output_tokens
kimi-k2.5,10,0,1401.475999969989,1592.0816250145435,True,get_weather,True,tool_calls,True,200,,567,19
kimi-k2.5,10,1,1258.3429580554366,1510.1907080970705,True,get_weather,True,tool_calls,True,200,,567,19
kimi-k2.5,10,2,1321.4867911301553,1521.6505001299083,True,get_weather,True,tool_calls,True,200,,567,19
kimi-k2.5,30,0,1440.3653340414166,1613.7961251661181,True,get_weather,True,tool_calls,True,200,,1634,19
kimi-k2.5,30,1,1331.3715420663357,1685.344542376697,True,get_weather,True,tool_calls,True,200,,1634,25
kimi-k2.5,30,2,1247.8924999013543,1527.93433284387,True,get_weather,True,tool_calls,True,200,,1634,19
kimi-k2.5,44,0,1469.7794998064637,1902.680415660143,True,get_weather,True,tool_calls,True,200,,2278,25
kimi-k2.5,44,1,1699.4920000433922,1980.4046670906246,True,get_weather,True,tool_calls,True,200,,2278,19
kimi-k2.5,44,2,1571.8187079764903,1969.8556666262448,True,get_weather,True,tool_calls,True,200,,2278,25
kimi-k2.5,50,0,1482.4839583598077,1929.6910003758967,True,get_weather,True,tool_calls,True,200,,2527,25
kimi-k2.5,50,1,1624.3552910163999,1898.1102909892797,True,get_weather,True,tool_calls,True,200,,2527,19
kimi-k2.5,50,2,1542.7984166890383,1936.8925830349326,True,get_weather,True,tool_calls,True,200,,2527,25
kimi-k2.5,100,0,1701.4469159767032,1899.6567083522677,True,get_weather,True,tool_calls,True,200,,4790,19
kimi-k2.5,100,1,1444.016749970615,1659.4132906757295,True,get_weather,True,tool_calls,True,200,,4790,19
kimi-k2.5,100,2,1546.6624172404408,1897.4306671880186,True,get_weather,True,tool_calls,True,200,,4790,19
kimi-k2.5,200,0,2064.858333207667,2329.3212922289968,True,get_weather,True,tool_calls,True,200,,9461,19
kimi-k2.5,200,1,2027.2286250256002,2169.1031251102686,True,get_weather,True,tool_calls,True,200,,9461,19
kimi-k2.5,200,2,2229.511708021164,2622.2758749499917,True,get_weather,True,tool_calls,True,200,,9461,25
kimi-k2.5,500,0,2751.9294172525406,3476.1431673541665,True,get_weather,True,tool_calls,True,200,,23424,29
kimi-k2.5,500,1,2230.998042039573,2962.730874773115,True,get_weather,True,tool_calls,True,200,,23424,29
kimi-k2.5,500,2,5107.52824973315,5601.727457717061,True,get_weather,True,tool_calls,True,200,,23424,25
kimi-k2.6,10,0,1986.986625008285,2247.668833937496,True,get_weather,True,tool_calls,True,200,,567,19
kimi-k2.6,10,1,1146.7030001804233,1405.7195419445634,True,get_weather,True,tool_calls,True,200,,567,19
kimi-k2.6,10,2,1216.0448753274977,1423.7766251899302,True,get_weather,True,tool_calls,True,200,,567,18
kimi-k2.6,30,0,1365.493499673903,1710.8979169279337,True,get_weather,True,tool_calls,True,200,,1634,19
kimi-k2.6,30,1,1361.1227083019912,1816.4757913909853,True,get_weather,True,tool_calls,True,200,,1634,25
kimi-k2.6,30,2,1298.269959166646,1702.3757919669151,True,get_weather,True,tool_calls,True,200,,1634,25
kimi-k2.6,44,0,1562.399832997471,1810.7880419120193,True,get_weather,True,tool_calls,True,200,,2278,19
kimi-k2.6,44,1,1565.6340001150966,1977.5530002079904,True,get_weather,True,tool_calls,True,200,,2278,25
kimi-k2.6,44,2,1577.0149580202997,1905.4455826990306,True,get_weather,True,tool_calls,True,200,,2278,19
kimi-k2.6,50,0,2526.763124857098,2902.31362497434,True,get_weather,True,tool_calls,True,200,,2527,25
kimi-k2.6,50,1,1692.777959164232,2217.6276249811053,True,get_weather,True,tool_calls,True,200,,2527,25
kimi-k2.6,50,2,1439.3659168854356,1845.2294575981796,True,get_weather,True,tool_calls,True,200,,2527,25
kimi-k2.6,100,0,2434.119542129338,2628.9280001074076,True,get_weather,True,tool_calls,True,200,,4790,25
kimi-k2.6,100,1,1591.6625000536442,2065.4458748176694,True,get_weather,True,tool_calls,True,200,,4790,25
kimi-k2.6,100,2,1393.5351655818522,1607.8764577396214,True,get_weather,True,tool_calls,True,200,,4790,19
kimi-k2.6,200,0,2074.4173750281334,2345.9046250209212,True,get_weather,True,tool_calls,True,200,,9461,19
kimi-k2.6,200,1,1850.7448751479387,2253.8724592886865,True,get_weather,True,tool_calls,True,200,,9461,25
kimi-k2.6,200,2,2208.9466666802764,2624.054624699056,True,get_weather,True,tool_calls,True,200,,9461,25
kimi-k2.6,500,0,3145.9837090224028,3679.9117093905807,True,get_weather,True,tool_calls,True,200,,23424,25
kimi-k2.6,500,1,2440.7394169829786,2966.8034580536187,True,get_weather,True,tool_calls,True,200,,23424,25
kimi-k2.6,500,2,2953.866207972169,3266.4410420693457,True,get_weather,True,tool_calls,True,200,,23424,19
4 changes: 4 additions & 0 deletions results/2026-05-28/kimi-summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
| Model | N=10 | N=30 | N=44 | N=50 | N=100 | N=200 | N=500 |
|---|---|---|---|---|---|---|---|
| kimi-k2.5 | 1321/1522ms (rel=100%) | 1331/1614ms (rel=100%) | 1572/1970ms (rel=100%) | 1543/1930ms (rel=100%) | 1547/1897ms (rel=100%) | 2065/2329ms (rel=100%) | 2752/3476ms (rel=100%) |
| kimi-k2.6 | 1216/1424ms (rel=100%) | 1361/1711ms (rel=100%) | 1566/1905ms (rel=100%) | 1693/2218ms (rel=100%) | 1592/2065ms (rel=100%) | 2074/2346ms (rel=100%) | 2954/3266ms (rel=100%) |
Loading