add kimi-k2.5 / k2.6 via autodl (production-routed kimi benchmark)#1
Open
ymote wants to merge 4 commits into
Open
add kimi-k2.5 / k2.6 via autodl (production-routed kimi benchmark)#1ymote wants to merge 4 commits into
ymote wants to merge 4 commits into
Conversation
Live production incident on octos mini3 (session 8w2ime, kimi-k2.5):
multi-turn slides workflow hit a 5-iteration loop on
check_workspace_contract before the runtime broke it with
"LOOP DETECTED". The runtime emitted
WARN high tool count may cause empty responses with some models; tools=44
at the time. We attributed this to "44 tools is too many for kimi".
This run measures whether that attribution holds.
Setup:
- Add `autodl` provider (https://www.autodl.art/api/v1) — production
relay used by the dspfac fleet for moonshot/kimi models.
- Add `moonshot` provider (https://api.moonshot.ai/v1) — direct
endpoint, lets future runs separate "model behaviour" from "relay
behaviour" when both keys are available.
- Add `autodl`/`moonshot` to the CLI choices list.
- New sweep config configs/run-2026-05-28-kimi.yaml — N bracketing 44
to get clean before/after signal: [10, 30, 44, 50, 100, 200, 500],
3 trials per (model, N), $5 cap.
Results (median TTFT / total ms, with relevant-tool selection rate):
| Model | N=10 | N=44 | N=200 | N=500 |
|-----------|-------------|-------------|-------------|-------------|
| kimi-k2.5 | 1321/1522 100% | 1572/1970 100% | 2065/2329 100% | 2752/3476 100% |
| kimi-k2.6 | 1216/1424 100% | 1566/1905 100% | 2074/2346 100% | 2954/3266 100% |
Both kimi models hit 100% relevant-tool selection at every N. Tool count
is not the cause of the production loop.
What this run does NOT measure (deliberately documented in the README):
- Multi-turn behaviour. Production session had messages=139,
input_tokens=53k of conversational history. This benchmark is
single-turn.
- Cumulative tool-result feedback. The failing turn had repeatedly
seen the same tool's output earlier in history.
- autodl proxy buffering at very long context. N=500 here is only
~23k input tokens; production failure was at ~53k.
If the loop recurs, the right next experiment is a multi-turn replay
of the failing conversation shape, not more N-padding on single-turn
prompts.
Follow-up to the kimi-k2.5 sweep that ruled out tool count. Two more
experiments to pin down the actual cause of the mini3 production loop:
1. scripts/termination_signal_repro.py — synthetic single-turn test:
user says "is the deck done?" after the assistant has just received
a check_workspace_contract result. Vary whether the result has
the new all_ready flag. Run vs kimi-k2.5 and claude-opus-4.7 via
OpenRouter, 5 trials each. Result: all 4 arms TEXT 5/5. The
"missing termination signal" hypothesis is refuted.
2. scripts/termination_signal_replay.py — replay the actual failing
conversation: load 89 messages from the mini3 session JSONL up to
the trigger user message ("是你的 manifest 文件生成问题,图片都
存在"), ask each model what it does next. Two arms each: only
check_workspace_contract available vs that tool + read_file +
list_dir. Result:
A: kimi + check_only LOOP=3/3
B: kimi + check + read_file + list_dir LOOP=2/3 OTHER=1/3
C: opus + check_only LOOP=3/3
D: opus + check + read_file + list_dir LOOP=0/3 OTHER=3/3
Findings:
- The loop IS reproducible at 100% when only check_workspace_contract
is available, in both kimi and Opus. This is NOT a kimi degradation.
- Adding tools that CAN answer the user's question (read_file /
list_dir for "is the manifest file wrong?") snaps Opus out 3/3;
Kimi 1/3. Kimi is stickier on its first tool choice.
- The real root cause is tool-question mismatch: the user asked about
manifest content; the only directly-relevant tool answers artifact
presence. The LLM kept re-querying because no available tool could
answer what was actually asked.
- The all_ready flag we added in octos PR #1360 is not the fix for
this. The fix is tool-description discipline (document what each
tool does NOT answer) and a tool-switch nudge after iter-2 of the
same tool with same args.
README updated with the full finding so future readers don't fall
into the same chain of wrong hypotheses we did.
After the first replay confirmed tool-question mismatch as root cause,
update the check_workspace_contract description in the replay script
to mirror the octos PR that tightens it (adds an explicit
"DOES NOT answer: file content, manifest layout, image quality —
use read_file / list_dir / view_image instead" clause).
Re-run 5 trials per arm. Comparison vs the v1 run (3 trials):
| arm | v1 (old desc) | v2 (new desc) |
|--------------------------------------|-----------------------|-----------------------|
| A: kimi + check_only | LOOP 3/3 | LOOP 5/5 |
| B: kimi + check + read_file + list | LOOP 2/3, OTHER 1/3 | LOOP 2/5, TEXT 3/5 |
| C: opus + check_only | LOOP 3/3 | LOOP 2/5, OTHER 3/5 |
| D: opus + check + read_file + list | LOOP 0/3, OTHER 3/3 | LOOP 0/5, OTHER 5/5 |
Net effect of the description tightening:
- Arm A (no alternative tool) unchanged — description alone can't
conjure a tool that doesn't exist.
- Arm B (kimi w/ alternatives) — loop rate down from 67% to 40%, and
the non-looping trials shift from "calls a different tool" to
"responds with text". Mixed; not as decisive as we'd want.
- Arm C (opus w/ only check) — most informative result. With the
tightened description, opus 3/5 hallucinates a `list_dir` call even
though only check_workspace_contract is in the schema. It's an
invalid tool call, but the model is signalling "this tool can't
answer the question, give me a different one". Worth noting that
the description is doing its job at the comprehension layer even
when the catalog is wrong.
- Arm D (opus w/ alternatives) — already perfect, stays perfect.
Conclusion: description-tightening is a real improvement, not a magic
bullet. The full fix on the octos side is description + ensuring the
slides agent always has read_file/list_dir/view_image in catalog +
possibly a kimi-specific system-prompt nudge ("if a tool returns the
same answer twice, switch tools").
Three more replay sets: 1. termination-repro-rerun: re-ran the synthetic single-turn test to confirm 5-trial repeatability (all 4 arms TEXT 5/5, no loops). 2. termination-replay-rerun: re-ran the full-history replay with the tightened tool description (from llm-benchmark PR #1 v2). Arm A stayed LOOP 5/5; arm C improved 2/5 -> 0/5 between runs (Opus increasingly trusts the new "DOES NOT answer" clause). 3. termination-replay-generic-prompt: same replay but with a generic "Tool use discipline" block added to SYSTEM_PROMPT. This is the block that landed in octos PR #1362. Arm A summary across all runs: v1 (3 trials, old desc, no block) LOOP 3/3 v2 (5 trials, new desc, no block) LOOP 5/5 rerun (5 trials, new desc, no block) LOOP 5/5 generic-prompt (5 trials, new desc + block) LOOP 3/5 Block lowers worst-case kimi loop rate from 100% to 60%. Adds ~100 prompt tokens. Strong models unaffected (Opus already at 0/5 in C and D both before and after).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See commit message — kimi-k2.5/k2.6 measured via the same autodl/wisemodel relay used by octos production. Both hit 100% relevant-tool selection at every N from 10 to 500, including N=44. Tool count was not the cause of the production loop. Single-turn caveat documented in README.