fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models#1482
fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models#1482CatherineSue wants to merge 4 commits into
Conversation
Kimi K2/K2.5/K2.6 ship tiktoken-style vocabs with a 256-slot reserved
special-token range starting at `len(mergeable_ranks)` and a Han-aware
pre-tokenizer regex. The generic tiktoken loader knew nothing about
either, so:
- IDs in the reserved range that `tokenizer_config.json` did not enumerate
(e.g. anything other than `<|im_assistant|>` etc.) panicked on decode
via tiktoken-rs's `_decode_native_and_split` lossy-fallback path.
- The cl100k regex was used for pre-tokenization, ignoring Kimi's
`[\p{Han}]+` leading alternation.
Detection and specialization now live inside `TiktokenTokenizer::load_from_path`,
so `factory.rs` stays untouched and Kimi support is invisible at the type
level — there is no separate public `KimiK2Tokenizer`. The Kimi module is
crate-private and exposes only three items to its sibling: the regex
constant, a `matches_dir` detector, and a reserved-special-token fill helper.
Kimi detection matches all three known variants:
- primary: `tokenizer_config.json` mentions `tokenization_kimi`
- fallback: `config.json::model_type` is `kimi_k2` or `kimi_k25`
(K2-Thinking reports `kimi_k2`; K2.5 and K2.6 both report `kimi_k25`)
A separate decode guard returns a clean error on truly unknown token IDs
instead of letting the lossy-decode fallback panic. The guard matches
tiktoken-rs's `DecodeKeyError` Display message; the typed error is not
re-exported by the crate as of 0.9.1.
Fixes #1475
Signed-off-by: Chang Su <[email protected]>
📝 WalkthroughWalkthroughAdds Kimi K2/K2.5/K2.6 tokenization support with a Han-aware regex, synthesizes 256 reserved special tokens at the vocab boundary, refactors tiktoken loading to detect/apply Kimi patterns, and surfaces a clean error for unknown token IDs to avoid decode panics. ChangesKimi K2 tokenizer integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Clean, well-tested addition of Kimi K2/K2.5/K2.6 tokenizer support. The Kimi-specific logic is nicely isolated in its own module, the refactoring of load_from_path into from_encoder_and_config is a good seam for future model-specific specialization, and the unknown-token decode error fix prevents a panic in the lossy-decode fallback. One nit flagged inline about the K2.6 model_type fallback gap.
Summary: 0 🔴 Important · 1 🟡 Nit · 0 🟣 Pre-existing
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/tokenizer/src/kimi_k2_tokenizer.rs`:
- Around line 84-89: The current value_mentions_kimi_tokenizer function uses
s.contains("tokenization_kimi") which produces false positives; change the
string check in Value::String(s) to a stricter match (e.g., exact equality or a
word-boundary/regex match) so only true identifiers like "tokenization_kimi" or
token boundaries are accepted (preserve recursion for Value::Array and
Value::Object); update the check in value_mentions_kimi_tokenizer (the
Value::String arm) to use equality or a whole-word/regex test instead of
contains.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 104fa769-894c-4321-bea9-37e64f20a83b
📒 Files selected for processing (3)
crates/tokenizer/src/kimi_k2_tokenizer.rscrates/tokenizer/src/lib.rscrates/tokenizer/src/tiktoken.rs
…atches `load_from_path` no longer delegates to a separate `from_encoder_and_config` seam — that seam only existed to let the old external `KimiK2Tokenizer` type build the same struct. With Kimi specialization living inside `load_from_path` itself, the helper has no second caller and is folded back. Revert the matching `pub(crate)` visibility on `CL100K_BASE_PATTERN`, `Rank`, `TiktokenConfig` (and its fields), `load_tiktoken_config`, `load_tiktoken_config_from_dir`, `load_tiktoken_bpe`, and `find_tiktoken_file`. Nothing outside `tiktoken.rs` references any of them now, so they go back to private. The original step-numbered comments (1–5) in `load_from_path` are preserved. No behavior change. `cargo test -p llm-tokenizer --lib` still 132/132; clippy clean. Signed-off-by: Chang Su <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request introduces specialized support for Kimi-K2, K2.5, and K2.6 models by implementing a Han-aware tokenization regex and handling a 256-slot reserved special-token range. The TiktokenTokenizer was refactored to support these specializations through automatic directory detection and configuration loading, and error handling was improved for unknown token IDs during decoding. Review feedback suggested using checked_add to prevent potential overflows when generating reserved token IDs and recommended refactoring the configuration loading process to eliminate redundant I/O during model detection.
| load_tiktoken_config(&config_path)? | ||
| let mut config = load_tiktoken_config_from_dir(dir)?; | ||
|
|
||
| let pattern = if kimi_k2_tokenizer::matches_dir(dir) { |
There was a problem hiding this comment.
This call to matches_dir leads to redundant I/O because it re-reads tokenizer_config.json, which was already read and parsed by load_tiktoken_config_from_dir on line 234. Consider refactoring load_tiktoken_config_from_dir to return the raw serde_json::Value alongside the TiktokenConfig, and update matches_dir to accept an optional pre-parsed config value to avoid unnecessary disk access and improve maintainability.
References
- Extract duplicated logic into a shared helper function to improve maintainability and reduce redundancy.
There was a problem hiding this comment.
Fixed in 9c48829. load_tiktoken_config_from_dir now returns (TiktokenConfig, Option<Value>) and kimi_k2_tokenizer::matches takes the pre-parsed value, so tokenizer_config.json is read once. config.json fallback still does its own read (different file).
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 62-69: Add a unit test that verifies load_tiktoken_config_from_dir
returns TiktokenConfig::default() when the directory lacks a
tokenizer_config.json: create a temporary directory, do not write
tokenizer_config.json, call load_tiktoken_config_from_dir(temp_dir.path()), and
assert the Result is Ok and equals TiktokenConfig::default(). Place the test
near other tokenizer tests (e.g., in the same module/tests area) and name it
something like load_tiktoken_config_from_dir_missing_file_returns_default to
make the intent clear.
- Around line 518-520: The helper is_unknown_tiktoken_decode_error currently
relies on the tiktoken-rs error string prefix "Invalid token for decoding:"
which is an internal formatting detail; update the code to either add an
explicit INVARIANT comment above is_unknown_tiktoken_decode_error documenting
that this prefix is a relied-upon upstream contract, or add a regression test
that constructs the upstream decode error (via vendor_tiktoken::DecodeKeyError
or by triggering the decode path) and asserts the error message format to detect
upstream changes; reference the function name is_unknown_tiktoken_decode_error
when adding the comment or test so reviewers can easily find and verify the
stability assumption.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6dda397c-12f4-453f-8bf5-2c7b755aff7f
📒 Files selected for processing (1)
crates/tokenizer/src/tiktoken.rs
Address two unresolved review threads on PR 1482:
1. (coderabbit) `value_mentions_kimi_tokenizer` matched any string
containing `tokenization_kimi` as a substring, which would
misclassify identifiers like `tokenization_kimi_v2` or
`my_tokenization_kimi_helper`. Replace the `.contains(...)` test
with a dotted-segment match via `split('.').any(...)`, so only
real Python module identifiers (`tokenization_kimi`,
`tokenization_kimi.TikTokenTokenizer`,
`pkg.tokenization_kimi.X`) match. Add a regression test.
2. (gemini) `load_from_path` read `tokenizer_config.json` twice —
once via `load_tiktoken_config_from_dir`, then again inside
`kimi_k2_tokenizer::matches_dir`. Collapse to a single read:
`load_tiktoken_config_from_dir` now returns the parsed JSON
value alongside the `TiktokenConfig`, and the Kimi entry point
is renamed `matches(tokenizer_config: Option<&Value>, dir)`
so callers pass the parsed value in. The `config.json`
fallback still does its own read (different file).
`cargo test -p llm-tokenizer --lib` → 133/133 (was 132, gained the
false-positive guard). Clippy clean.
Signed-off-by: Chang Su <[email protected]>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/tokenizer/src/tiktoken.rs (1)
511-520: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueClarify the stability assumption with an INVARIANT marker.
The comment claims the error message format is "stable" based on internal
vendor_tiktokencode, but this isn't a public API contract. Per repository conventions, consider using anINVARIANT:marker to document this relied-upon assumption.Suggested documentation improvement
/// Detect tiktoken's "unknown token id" error so we can surface a clean error /// instead of letting the lossy-decode fallback panic on a missing key. /// -/// We match on the `Display` string because tiktoken-rs's `DecodeKeyError` lives -/// in a private `vendor_tiktoken` module and isn't re-exported (as of 0.9.1), -/// so a typed `downcast_ref` is not available. The message format is stable — -/// see `vendor_tiktoken::DecodeKeyError::fmt` upstream. +/// We match on the `Display` string because tiktoken-rs's `DecodeKeyError` lives +/// in a private `vendor_tiktoken` module and isn't re-exported (as of 0.9.1), +/// so a typed `downcast_ref` is not available. +/// +/// INVARIANT: The error prefix "Invalid token for decoding:" is assumed stable +/// based on tiktoken-rs 0.9.1's `vendor_tiktoken::DecodeKeyError::fmt`. If the +/// upstream format changes, this check will fail to match and the error will +/// fall through to lossy decoding (which may panic on truly unknown IDs). fn is_unknown_tiktoken_decode_error(err: &Error) -> bool {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/tokenizer/src/tiktoken.rs` around lines 511 - 520, Update the doc comment above is_unknown_tiktoken_decode_error to mark the stability assumption explicitly with an INVARIANT: tag; keep the explanation that we match on Display because vendor_tiktoken::DecodeKeyError is private, but add a line like "INVARIANT: the upstream vendor_tiktoken::DecodeKeyError fmt string 'Invalid token for decoding:' is relied upon here and must be updated if that implementation changes" so future readers know this is an explicit contract tied to the upstream implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 511-520: Update the doc comment above
is_unknown_tiktoken_decode_error to mark the stability assumption explicitly
with an INVARIANT: tag; keep the explanation that we match on Display because
vendor_tiktoken::DecodeKeyError is private, but add a line like "INVARIANT: the
upstream vendor_tiktoken::DecodeKeyError fmt string 'Invalid token for
decoding:' is relied upon here and must be updated if that implementation
changes" so future readers know this is an explicit contract tied to the
upstream implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 34abce70-ea5d-44f5-b69b-ac14c33fc48e
📒 Files selected for processing (2)
crates/tokenizer/src/kimi_k2_tokenizer.rscrates/tokenizer/src/tiktoken.rs
`load_tiktoken_config_from_dir` returns `(default, None)` when the file is absent; every other test writes one, so the branch was uncovered. Signed-off-by: Chang Su <[email protected]>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/tokenizer/src/tiktoken.rs (1)
514-517:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDon't present the tiktoken error prefix as a stable contract.
This helper still depends on an implementation-detail
Displaystring from tiktoken-rs. If that formatting changes in a dependency bump, unknown IDs will miss this guard and drop back into the panic-prone fallback path. Please either document this as anINVARIANT:and pin it with a regression test, or soften the comment so it reads as a best-effort heuristic rather than a guaranteed contract.In tiktoken-rs 0.9.1, is `DecodeKeyError` publicly re-exported, and is the `Display` prefix `Invalid token for decoding:` documented as a stable/public API contract?Based on learnings, use
INVARIANT:to document assumptions in safe Rust code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/tokenizer/src/tiktoken.rs` around lines 514 - 517, The code relies on the tiktoken-rs DecodeKeyError Display prefix ("Invalid token for decoding:") as a stable contract; either document this as an explicit INVARIANT and add a regression test that asserts the current Display string (via the helper that matches the Display output or by constructing vendor_tiktoken::DecodeKeyError and calling fmt/display) so CI will catch future formatting changes, or relax the comment to state this is a best-effort heuristic (not a guarantee) and remove wording that treats the Display value as a stable API; update the comment around the matching logic and add the unit test that fails if the Display prefix changes (or remove hard equality checks in the helper to fall back more safely).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 514-517: The code relies on the tiktoken-rs DecodeKeyError Display
prefix ("Invalid token for decoding:") as a stable contract; either document
this as an explicit INVARIANT and add a regression test that asserts the current
Display string (via the helper that matches the Display output or by
constructing vendor_tiktoken::DecodeKeyError and calling fmt/display) so CI will
catch future formatting changes, or relax the comment to state this is a
best-effort heuristic (not a guarantee) and remove wording that treats the
Display value as a stable API; update the comment around the matching logic and
add the unit test that fails if the Display prefix changes (or remove hard
equality checks in the helper to fall back more safely).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4aec6834-3459-41f5-8834-04dd1729e9a5
📒 Files selected for processing (1)
crates/tokenizer/src/tiktoken.rs
Description
Problem
Kimi K2 / K2.5 / K2.6 ship tiktoken-style vocabs alongside their HF repos. Their
tokenizer_config.jsonenumerates a handful of named special tokens (e.g.<|im_assistant|>at ID 163588), but does not enumerate every slot in the 256-slot reserved-special-token range that upstreamtokenization_kimi.pysynthesizes via:{special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i for i in range(num_base_tokens, num_base_tokens + 256)}The generic
TiktokenTokenizeronly registered IDs present intiktoken.modeloradded_tokens_decoder, so unnamed reserved IDs were unknown toCoreBPE. Decoding a sequence that contained one fell through into_decode_native_and_split, which indexesspecial_tokens_decoder[&token]directly — panic on the missing key.There is also a pre-tokenizer mismatch: Kimi's regex begins with
[\p{Han}]+and uses character-class intersection ([X&&[^\p{Han}]]) so Han chars are split from the letter/contraction branches. The generic path used the cl100k regex, mangling Han-heavy inputs.Fixes #1475.
Solution
Detect Kimi inside
TiktokenTokenizer::load_from_pathand specialize two things:pat_str, verified against K2-Thinking, K2.5, and K2.6 — all threetokenization_kimi.pyfiles are identical on the BPE-relevant fields).[len(mergeable_ranks), len(mergeable_ranks)+256)with<|reserved_token_{id}|>, skipping any IDs already occupied by explicitadded_tokens_decoderentries.Detection lives in a crate-private
kimi_k2_tokenizerhelper module and exposes only three items totiktoken.rs: the regex constant,matches_dir, andapply_reserved_special_tokens. There is no separate publicKimiK2Tokenizertype and no change tofactory.rs— Kimi support is transparent at the call surface, so future tiktoken variants (LLaMA-3-tiktoken, etc.) can follow the same pattern without growing a dispatch table.Detection matches all three Kimi variants:
tokenizer_config.jsonmentionstokenization_kimi(viaauto_map,tokenizer_class, etc.).config.json::model_type∈{kimi_k2, kimi_k25}. K2-Thinking reportskimi_k2; K2.5 and K2.6 both reportkimi_k25. The architecture string is intentionally not checked — K2-Thinking sharesDeepseekV3ForCausalLMwith vanilla DeepSeek and would false-positive.Separately, the generic tiktoken decode path now returns a clean
Erron truly unknown token IDs instead of falling into the lossy-byte path that panics. The guard matches tiktoken-rs'sDecodeKeyErrorDisplay message; the typed error lives in a private module (vendor_tiktoken) and isn't re-exported as of tiktoken-rs 0.9.1, so a typed downcast isn't available.Changes
crates/tokenizer/src/kimi_k2_tokenizer.rs(new, crate-private):KIMI_K2_PATTERN,matches_dir,apply_reserved_special_tokens, and helpers for the two-tier detection.crates/tokenizer/src/tiktoken.rs:load_from_pathnow consults the Kimi module, picks the pattern, pre-fills reserved slots before constructingCoreBPE;from_encoder_and_configis now private.decodereturns a structured error on unknown IDs instead of letting_decode_native_and_splitpanic.crates/tokenizer/src/lib.rs:mod kimi_k2_tokenizer(waspub mod); the helper module has no public surface.Test Plan
cargo test -p llm-tokenizer --lib— 132/132 pass, including five new Kimi tests:reserved_special_tokens_are_synthesized— 256 slots filled, explicit names preserved, vocab_size correct, decode/encode of synthetic reserved IDs round-trips.matches_via_model_type_kimi_k2—model_type: "kimi_k2"(K2-Thinking) triggers Kimi load path.matches_via_model_type_kimi_k25—model_type: "kimi_k25"(K2.5 + K2.6) triggers Kimi load path.uses_kimi_pattern_for_contractions— the Kimi regex matcheshello'sas one chunk (cl100k would split before's).han_input_round_trips_through_kimi_pattern— Han + ASCII mixed input encodes without panic and decodes byte-equal. Guards the fancy-regex character-class-intersection compatibility risk.cargo clippy -p llm-tokenizer --lib --tests --no-depsclean.cargo +nightly fmt -p llm-tokenizer -- --checkclean.Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Bug Fixes
Tests