add kimi-k2.5 / k2.6 via autodl (production-routed kimi benchmark) by ymote · Pull Request #1 · octos-org/llm-benchmark

ymote · 2026-05-29T03:08:16Z

See commit message — kimi-k2.5/k2.6 measured via the same autodl/wisemodel relay used by octos production. Both hit 100% relevant-tool selection at every N from 10 to 500, including N=44. Tool count was not the cause of the production loop. Single-turn caveat documented in README.

Live production incident on octos mini3 (session 8w2ime, kimi-k2.5): multi-turn slides workflow hit a 5-iteration loop on check_workspace_contract before the runtime broke it with "LOOP DETECTED". The runtime emitted WARN high tool count may cause empty responses with some models; tools=44 at the time. We attributed this to "44 tools is too many for kimi". This run measures whether that attribution holds. Setup: - Add `autodl` provider (https://www.autodl.art/api/v1) — production relay used by the dspfac fleet for moonshot/kimi models. - Add `moonshot` provider (https://api.moonshot.ai/v1) — direct endpoint, lets future runs separate "model behaviour" from "relay behaviour" when both keys are available. - Add `autodl`/`moonshot` to the CLI choices list. - New sweep config configs/run-2026-05-28-kimi.yaml — N bracketing 44 to get clean before/after signal: [10, 30, 44, 50, 100, 200, 500], 3 trials per (model, N), $5 cap. Results (median TTFT / total ms, with relevant-tool selection rate): | Model | N=10 | N=44 | N=200 | N=500 | |-----------|-------------|-------------|-------------|-------------| | kimi-k2.5 | 1321/1522 100% | 1572/1970 100% | 2065/2329 100% | 2752/3476 100% | | kimi-k2.6 | 1216/1424 100% | 1566/1905 100% | 2074/2346 100% | 2954/3266 100% | Both kimi models hit 100% relevant-tool selection at every N. Tool count is not the cause of the production loop. What this run does NOT measure (deliberately documented in the README): - Multi-turn behaviour. Production session had messages=139, input_tokens=53k of conversational history. This benchmark is single-turn. - Cumulative tool-result feedback. The failing turn had repeatedly seen the same tool's output earlier in history. - autodl proxy buffering at very long context. N=500 here is only ~23k input tokens; production failure was at ~53k. If the loop recurs, the right next experiment is a multi-turn replay of the failing conversation shape, not more N-padding on single-turn prompts.

Follow-up to the kimi-k2.5 sweep that ruled out tool count. Two more experiments to pin down the actual cause of the mini3 production loop: 1. scripts/termination_signal_repro.py — synthetic single-turn test: user says "is the deck done?" after the assistant has just received a check_workspace_contract result. Vary whether the result has the new all_ready flag. Run vs kimi-k2.5 and claude-opus-4.7 via OpenRouter, 5 trials each. Result: all 4 arms TEXT 5/5. The "missing termination signal" hypothesis is refuted. 2. scripts/termination_signal_replay.py — replay the actual failing conversation: load 89 messages from the mini3 session JSONL up to the trigger user message ("是你的 manifest 文件生成问题，图片都存在"), ask each model what it does next. Two arms each: only check_workspace_contract available vs that tool + read_file + list_dir. Result: A: kimi + check_only LOOP=3/3 B: kimi + check + read_file + list_dir LOOP=2/3 OTHER=1/3 C: opus + check_only LOOP=3/3 D: opus + check + read_file + list_dir LOOP=0/3 OTHER=3/3 Findings: - The loop IS reproducible at 100% when only check_workspace_contract is available, in both kimi and Opus. This is NOT a kimi degradation. - Adding tools that CAN answer the user's question (read_file / list_dir for "is the manifest file wrong?") snaps Opus out 3/3; Kimi 1/3. Kimi is stickier on its first tool choice. - The real root cause is tool-question mismatch: the user asked about manifest content; the only directly-relevant tool answers artifact presence. The LLM kept re-querying because no available tool could answer what was actually asked. - The all_ready flag we added in octos PR #1360 is not the fix for this. The fix is tool-description discipline (document what each tool does NOT answer) and a tool-switch nudge after iter-2 of the same tool with same args. README updated with the full finding so future readers don't fall into the same chain of wrong hypotheses we did.

After the first replay confirmed tool-question mismatch as root cause, update the check_workspace_contract description in the replay script to mirror the octos PR that tightens it (adds an explicit "DOES NOT answer: file content, manifest layout, image quality — use read_file / list_dir / view_image instead" clause). Re-run 5 trials per arm. Comparison vs the v1 run (3 trials): | arm | v1 (old desc) | v2 (new desc) | |--------------------------------------|-----------------------|-----------------------| | A: kimi + check_only | LOOP 3/3 | LOOP 5/5 | | B: kimi + check + read_file + list | LOOP 2/3, OTHER 1/3 | LOOP 2/5, TEXT 3/5 | | C: opus + check_only | LOOP 3/3 | LOOP 2/5, OTHER 3/5 | | D: opus + check + read_file + list | LOOP 0/3, OTHER 3/3 | LOOP 0/5, OTHER 5/5 | Net effect of the description tightening: - Arm A (no alternative tool) unchanged — description alone can't conjure a tool that doesn't exist. - Arm B (kimi w/ alternatives) — loop rate down from 67% to 40%, and the non-looping trials shift from "calls a different tool" to "responds with text". Mixed; not as decisive as we'd want. - Arm C (opus w/ only check) — most informative result. With the tightened description, opus 3/5 hallucinates a `list_dir` call even though only check_workspace_contract is in the schema. It's an invalid tool call, but the model is signalling "this tool can't answer the question, give me a different one". Worth noting that the description is doing its job at the comprehension layer even when the catalog is wrong. - Arm D (opus w/ alternatives) — already perfect, stays perfect. Conclusion: description-tightening is a real improvement, not a magic bullet. The full fix on the octos side is description + ensuring the slides agent always has read_file/list_dir/view_image in catalog + possibly a kimi-specific system-prompt nudge ("if a tool returns the same answer twice, switch tools").

Three more replay sets: 1. termination-repro-rerun: re-ran the synthetic single-turn test to confirm 5-trial repeatability (all 4 arms TEXT 5/5, no loops). 2. termination-replay-rerun: re-ran the full-history replay with the tightened tool description (from llm-benchmark PR #1 v2). Arm A stayed LOOP 5/5; arm C improved 2/5 -> 0/5 between runs (Opus increasingly trusts the new "DOES NOT answer" clause). 3. termination-replay-generic-prompt: same replay but with a generic "Tool use discipline" block added to SYSTEM_PROMPT. This is the block that landed in octos PR #1362. Arm A summary across all runs: v1 (3 trials, old desc, no block) LOOP 3/3 v2 (5 trials, new desc, no block) LOOP 5/5 rerun (5 trials, new desc, no block) LOOP 5/5 generic-prompt (5 trials, new desc + block) LOOP 3/5 Block lowers worst-case kimi loop rate from 100% to 60%. Adds ~100 prompt tokens. Strong models unaffected (Opus already at 0/5 in C and D both before and after).

ymote added 4 commits May 28, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add kimi-k2.5 / k2.6 via autodl (production-routed kimi benchmark)#1

add kimi-k2.5 / k2.6 via autodl (production-routed kimi benchmark)#1
ymote wants to merge 4 commits into
mainfrom
feat/kimi-k2-via-autodl

ymote commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ymote commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant