Skip to content

fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models#1482

Open
CatherineSue wants to merge 4 commits into
mainfrom
claude/kimi-k2-tiktoken-support
Open

fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models#1482
CatherineSue wants to merge 4 commits into
mainfrom
claude/kimi-k2-tiktoken-support

Conversation

@CatherineSue
Copy link
Copy Markdown
Member

@CatherineSue CatherineSue commented May 13, 2026

Description

Problem

Kimi K2 / K2.5 / K2.6 ship tiktoken-style vocabs alongside their HF repos. Their tokenizer_config.json enumerates a handful of named special tokens (e.g. <|im_assistant|> at ID 163588), but does not enumerate every slot in the 256-slot reserved-special-token range that upstream tokenization_kimi.py synthesizes via:

{special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i
 for i in range(num_base_tokens, num_base_tokens + 256)}

The generic TiktokenTokenizer only registered IDs present in tiktoken.model or added_tokens_decoder, so unnamed reserved IDs were unknown to CoreBPE. Decoding a sequence that contained one fell through into _decode_native_and_split, which indexes special_tokens_decoder[&token] directly — panic on the missing key.

There is also a pre-tokenizer mismatch: Kimi's regex begins with [\p{Han}]+ and uses character-class intersection ([X&&[^\p{Han}]]) so Han chars are split from the letter/contraction branches. The generic path used the cl100k regex, mangling Han-heavy inputs.

Fixes #1475.

Solution

Detect Kimi inside TiktokenTokenizer::load_from_path and specialize two things:

  1. Pattern — use the Han-aware regex (byte-identical to upstream pat_str, verified against K2-Thinking, K2.5, and K2.6 — all three tokenization_kimi.py files are identical on the BPE-relevant fields).
  2. Reserved slots — fill 256 entries at [len(mergeable_ranks), len(mergeable_ranks)+256) with <|reserved_token_{id}|>, skipping any IDs already occupied by explicit added_tokens_decoder entries.

Detection lives in a crate-private kimi_k2_tokenizer helper module and exposes only three items to tiktoken.rs: the regex constant, matches_dir, and apply_reserved_special_tokens. There is no separate public KimiK2Tokenizer type and no change to factory.rs — Kimi support is transparent at the call surface, so future tiktoken variants (LLaMA-3-tiktoken, etc.) can follow the same pattern without growing a dispatch table.

Detection matches all three Kimi variants:

  • Primary: tokenizer_config.json mentions tokenization_kimi (via auto_map, tokenizer_class, etc.).
  • Fallback: config.json::model_type{kimi_k2, kimi_k25}. K2-Thinking reports kimi_k2; K2.5 and K2.6 both report kimi_k25. The architecture string is intentionally not checked — K2-Thinking shares DeepseekV3ForCausalLM with vanilla DeepSeek and would false-positive.

Separately, the generic tiktoken decode path now returns a clean Err on truly unknown token IDs instead of falling into the lossy-byte path that panics. The guard matches tiktoken-rs's DecodeKeyError Display message; the typed error lives in a private module (vendor_tiktoken) and isn't re-exported as of tiktoken-rs 0.9.1, so a typed downcast isn't available.

Changes

  • crates/tokenizer/src/kimi_k2_tokenizer.rs (new, crate-private): KIMI_K2_PATTERN, matches_dir, apply_reserved_special_tokens, and helpers for the two-tier detection.
  • crates/tokenizer/src/tiktoken.rs: load_from_path now consults the Kimi module, picks the pattern, pre-fills reserved slots before constructing CoreBPE; from_encoder_and_config is now private. decode returns a structured error on unknown IDs instead of letting _decode_native_and_split panic.
  • crates/tokenizer/src/lib.rs: mod kimi_k2_tokenizer (was pub mod); the helper module has no public surface.

Test Plan

cargo test -p llm-tokenizer --lib — 132/132 pass, including five new Kimi tests:

  • reserved_special_tokens_are_synthesized — 256 slots filled, explicit names preserved, vocab_size correct, decode/encode of synthetic reserved IDs round-trips.
  • matches_via_model_type_kimi_k2model_type: "kimi_k2" (K2-Thinking) triggers Kimi load path.
  • matches_via_model_type_kimi_k25model_type: "kimi_k25" (K2.5 + K2.6) triggers Kimi load path.
  • uses_kimi_pattern_for_contractions — the Kimi regex matches hello's as one chunk (cl100k would split before 's).
  • han_input_round_trips_through_kimi_pattern — Han + ASCII mixed input encodes without panic and decodes byte-equal. Guards the fancy-regex character-class-intersection compatibility risk.

cargo clippy -p llm-tokenizer --lib --tests --no-deps clean.
cargo +nightly fmt -p llm-tokenizer -- --check clean.

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Automatic detection and support for Kimi K2/K2.5/K2.6 tokenizers, including Han-aware tokenization and improved contraction handling.
    • Synthesized reserved special-token ranges to avoid ID conflicts.
  • Bug Fixes

    • Clearer user-facing error when decoding unknown token IDs.
    • More robust loading when tokenizer config file is absent.
  • Tests

    • Added tests for tokenizer detection, reserved-token synthesis, missing-config handling, and decode error behavior.

Review Change Stack

Kimi K2/K2.5/K2.6 ship tiktoken-style vocabs with a 256-slot reserved
special-token range starting at `len(mergeable_ranks)` and a Han-aware
pre-tokenizer regex. The generic tiktoken loader knew nothing about
either, so:

- IDs in the reserved range that `tokenizer_config.json` did not enumerate
  (e.g. anything other than `<|im_assistant|>` etc.) panicked on decode
  via tiktoken-rs's `_decode_native_and_split` lossy-fallback path.
- The cl100k regex was used for pre-tokenization, ignoring Kimi's
  `[\p{Han}]+` leading alternation.

Detection and specialization now live inside `TiktokenTokenizer::load_from_path`,
so `factory.rs` stays untouched and Kimi support is invisible at the type
level — there is no separate public `KimiK2Tokenizer`. The Kimi module is
crate-private and exposes only three items to its sibling: the regex
constant, a `matches_dir` detector, and a reserved-special-token fill helper.

Kimi detection matches all three known variants:
  - primary: `tokenizer_config.json` mentions `tokenization_kimi`
  - fallback: `config.json::model_type` is `kimi_k2` or `kimi_k25`
    (K2-Thinking reports `kimi_k2`; K2.5 and K2.6 both report `kimi_k25`)

A separate decode guard returns a clean error on truly unknown token IDs
instead of letting the lossy-decode fallback panic. The guard matches
tiktoken-rs's `DecodeKeyError` Display message; the typed error is not
re-exported by the crate as of 0.9.1.

Fixes #1475

Signed-off-by: Chang Su <[email protected]>
@CatherineSue CatherineSue requested a review from slin1237 as a code owner May 13, 2026 01:38
@github-actions github-actions Bot added the tokenizer Tokenizer related changes label May 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

Adds Kimi K2/K2.5/K2.6 tokenization support with a Han-aware regex, synthesizes 256 reserved special tokens at the vocab boundary, refactors tiktoken loading to detect/apply Kimi patterns, and surfaces a clean error for unknown token IDs to avoid decode panics.

Changes

Kimi K2 tokenizer integration

Layer / File(s) Summary
Kimi K2 tokenizer detection & reserved-token synthesis
crates/tokenizer/src/kimi_k2_tokenizer.rs, crates/tokenizer/src/lib.rs
Defines KIMI_K2_PATTERN, matches(tokenizer_config, dir) (searches tokenizer_config.json for tokenization_kimi or config.json::model_type for kimi_k2/kimi_k25), and apply_reserved_special_tokens that reserves 256 IDs starting at base_vocab_size by synthesizing `<
Refactored tiktoken loading with Kimi pattern & reserved-token handling
crates/tokenizer/src/tiktoken.rs
Adds directory-based config loader returning a parsed JSON value or TiktokenConfig::default() when missing, refactors TiktokenTokenizer::load_from_path to load config via the helper, detect Kimi directories, mutate config.added_tokens for Kimi models, and select KIMI_K2_PATTERN or CL100K_BASE_PATTERN. CoreBPE::new now uses the chosen pattern. Includes a test for missing tokenizer_config.json.
Decode error detection to prevent panic on unknown token IDs
crates/tokenizer/src/tiktoken.rs
Decoder::decode now recognizes tiktoken's unknown-token decode error and returns a clean Error::msg (via is_unknown_tiktoken_decode_error) instead of falling into a path that can panic. Adds test utilities (write_minimal_tiktoken_dir) and a test asserting the new error message for unknown token IDs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

  • lightseekorg/smg#439: Related tokenizer-config parsing and directory-based loading changes that this PR builds on.
  • lightseekorg/smg#769: Related changes to Decoder::decode error/losing-decode handling.

Suggested labels

tests

Suggested reviewers

  • slin1237
  • key4ng

Poem

🐰 A Kimi K2 song I wrote in a hop,
Han characters dance and never stop,
Reserved tokens lined in a neat little row,
Decoding now gentle — no panic, just flow,
Hooray for the tokenizer — nibble and go!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models' accurately and concisely describes the main change—adding support for Kimi K2 series tokenizers to the tokenizer module.
Linked Issues check ✅ Passed The PR meets all coding objectives from issue #1475: detects Kimi tokenizers via model_type or tokenization_kimi patterns, synthesizes 256 reserved-token slots with proper name preservation, and hardens decode error handling to prevent panics on unknown token IDs.
Out of Scope Changes check ✅ Passed All changes are directly scoped to supporting Kimi K2 tokenizers: the new kimi_k2_tokenizer module, tiktoken.rs refactoring for Kimi detection and reserved-token synthesis, and lib.rs module registration are all aligned with issue #1475 requirements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/kimi-k2-tiktoken-support

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread crates/tokenizer/src/kimi_k2_tokenizer.rs
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-tested addition of Kimi K2/K2.5/K2.6 tokenizer support. The Kimi-specific logic is nicely isolated in its own module, the refactoring of load_from_path into from_encoder_and_config is a good seam for future model-specific specialization, and the unknown-token decode error fix prevents a panic in the lossy-decode fallback. One nit flagged inline about the K2.6 model_type fallback gap.

Summary: 0 🔴 Important · 1 🟡 Nit · 0 🟣 Pre-existing

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/tokenizer/src/kimi_k2_tokenizer.rs`:
- Around line 84-89: The current value_mentions_kimi_tokenizer function uses
s.contains("tokenization_kimi") which produces false positives; change the
string check in Value::String(s) to a stricter match (e.g., exact equality or a
word-boundary/regex match) so only true identifiers like "tokenization_kimi" or
token boundaries are accepted (preserve recursion for Value::Array and
Value::Object); update the check in value_mentions_kimi_tokenizer (the
Value::String arm) to use equality or a whole-word/regex test instead of
contains.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 104fa769-894c-4321-bea9-37e64f20a83b

📥 Commits

Reviewing files that changed from the base of the PR and between 76d9c18 and c2df5be.

📒 Files selected for processing (3)
  • crates/tokenizer/src/kimi_k2_tokenizer.rs
  • crates/tokenizer/src/lib.rs
  • crates/tokenizer/src/tiktoken.rs

Comment thread crates/tokenizer/src/kimi_k2_tokenizer.rs
…atches

`load_from_path` no longer delegates to a separate `from_encoder_and_config`
seam — that seam only existed to let the old external `KimiK2Tokenizer`
type build the same struct. With Kimi specialization living inside
`load_from_path` itself, the helper has no second caller and is folded back.

Revert the matching `pub(crate)` visibility on `CL100K_BASE_PATTERN`,
`Rank`, `TiktokenConfig` (and its fields), `load_tiktoken_config`,
`load_tiktoken_config_from_dir`, `load_tiktoken_bpe`, and
`find_tiktoken_file`. Nothing outside `tiktoken.rs` references any of
them now, so they go back to private. The original step-numbered comments
(1–5) in `load_from_path` are preserved.

No behavior change. `cargo test -p llm-tokenizer --lib` still 132/132;
clippy clean.

Signed-off-by: Chang Su <[email protected]>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces specialized support for Kimi-K2, K2.5, and K2.6 models by implementing a Han-aware tokenization regex and handling a 256-slot reserved special-token range. The TiktokenTokenizer was refactored to support these specializations through automatic directory detection and configuration loading, and error handling was improved for unknown token IDs during decoding. Review feedback suggested using checked_add to prevent potential overflows when generating reserved token IDs and recommended refactoring the configuration loading process to eliminate redundant I/O during model detection.

Comment thread crates/tokenizer/src/kimi_k2_tokenizer.rs
Comment thread crates/tokenizer/src/tiktoken.rs Outdated
load_tiktoken_config(&config_path)?
let mut config = load_tiktoken_config_from_dir(dir)?;

let pattern = if kimi_k2_tokenizer::matches_dir(dir) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This call to matches_dir leads to redundant I/O because it re-reads tokenizer_config.json, which was already read and parsed by load_tiktoken_config_from_dir on line 234. Consider refactoring load_tiktoken_config_from_dir to return the raw serde_json::Value alongside the TiktokenConfig, and update matches_dir to accept an optional pre-parsed config value to avoid unnecessary disk access and improve maintainability.

References
  1. Extract duplicated logic into a shared helper function to improve maintainability and reduce redundancy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9c48829. load_tiktoken_config_from_dir now returns (TiktokenConfig, Option<Value>) and kimi_k2_tokenizer::matches takes the pre-parsed value, so tokenizer_config.json is read once. config.json fallback still does its own read (different file).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 62-69: Add a unit test that verifies load_tiktoken_config_from_dir
returns TiktokenConfig::default() when the directory lacks a
tokenizer_config.json: create a temporary directory, do not write
tokenizer_config.json, call load_tiktoken_config_from_dir(temp_dir.path()), and
assert the Result is Ok and equals TiktokenConfig::default(). Place the test
near other tokenizer tests (e.g., in the same module/tests area) and name it
something like load_tiktoken_config_from_dir_missing_file_returns_default to
make the intent clear.
- Around line 518-520: The helper is_unknown_tiktoken_decode_error currently
relies on the tiktoken-rs error string prefix "Invalid token for decoding:"
which is an internal formatting detail; update the code to either add an
explicit INVARIANT comment above is_unknown_tiktoken_decode_error documenting
that this prefix is a relied-upon upstream contract, or add a regression test
that constructs the upstream decode error (via vendor_tiktoken::DecodeKeyError
or by triggering the decode path) and asserts the error message format to detect
upstream changes; reference the function name is_unknown_tiktoken_decode_error
when adding the comment or test so reviewers can easily find and verify the
stability assumption.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6dda397c-12f4-453f-8bf5-2c7b755aff7f

📥 Commits

Reviewing files that changed from the base of the PR and between c2df5be and 9eb42a1.

📒 Files selected for processing (1)
  • crates/tokenizer/src/tiktoken.rs

Comment thread crates/tokenizer/src/tiktoken.rs Outdated
Comment thread crates/tokenizer/src/tiktoken.rs
Address two unresolved review threads on PR 1482:

1. (coderabbit) `value_mentions_kimi_tokenizer` matched any string
   containing `tokenization_kimi` as a substring, which would
   misclassify identifiers like `tokenization_kimi_v2` or
   `my_tokenization_kimi_helper`. Replace the `.contains(...)` test
   with a dotted-segment match via `split('.').any(...)`, so only
   real Python module identifiers (`tokenization_kimi`,
   `tokenization_kimi.TikTokenTokenizer`,
   `pkg.tokenization_kimi.X`) match. Add a regression test.

2. (gemini) `load_from_path` read `tokenizer_config.json` twice —
   once via `load_tiktoken_config_from_dir`, then again inside
   `kimi_k2_tokenizer::matches_dir`. Collapse to a single read:
   `load_tiktoken_config_from_dir` now returns the parsed JSON
   value alongside the `TiktokenConfig`, and the Kimi entry point
   is renamed `matches(tokenizer_config: Option<&Value>, dir)`
   so callers pass the parsed value in. The `config.json`
   fallback still does its own read (different file).

`cargo test -p llm-tokenizer --lib` → 133/133 (was 132, gained the
false-positive guard). Clippy clean.

Signed-off-by: Chang Su <[email protected]>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/tokenizer/src/tiktoken.rs (1)

511-520: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Clarify the stability assumption with an INVARIANT marker.

The comment claims the error message format is "stable" based on internal vendor_tiktoken code, but this isn't a public API contract. Per repository conventions, consider using an INVARIANT: marker to document this relied-upon assumption.

Suggested documentation improvement
 /// Detect tiktoken's "unknown token id" error so we can surface a clean error
 /// instead of letting the lossy-decode fallback panic on a missing key.
 ///
-/// We match on the `Display` string because tiktoken-rs's `DecodeKeyError` lives
-/// in a private `vendor_tiktoken` module and isn't re-exported (as of 0.9.1),
-/// so a typed `downcast_ref` is not available. The message format is stable —
-/// see `vendor_tiktoken::DecodeKeyError::fmt` upstream.
+/// We match on the `Display` string because tiktoken-rs's `DecodeKeyError` lives
+/// in a private `vendor_tiktoken` module and isn't re-exported (as of 0.9.1),
+/// so a typed `downcast_ref` is not available.
+///
+/// INVARIANT: The error prefix "Invalid token for decoding:" is assumed stable
+/// based on tiktoken-rs 0.9.1's `vendor_tiktoken::DecodeKeyError::fmt`. If the
+/// upstream format changes, this check will fail to match and the error will
+/// fall through to lossy decoding (which may panic on truly unknown IDs).
 fn is_unknown_tiktoken_decode_error(err: &Error) -> bool {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/tokenizer/src/tiktoken.rs` around lines 511 - 520, Update the doc
comment above is_unknown_tiktoken_decode_error to mark the stability assumption
explicitly with an INVARIANT: tag; keep the explanation that we match on Display
because vendor_tiktoken::DecodeKeyError is private, but add a line like
"INVARIANT: the upstream vendor_tiktoken::DecodeKeyError fmt string 'Invalid
token for decoding:' is relied upon here and must be updated if that
implementation changes" so future readers know this is an explicit contract tied
to the upstream implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 511-520: Update the doc comment above
is_unknown_tiktoken_decode_error to mark the stability assumption explicitly
with an INVARIANT: tag; keep the explanation that we match on Display because
vendor_tiktoken::DecodeKeyError is private, but add a line like "INVARIANT: the
upstream vendor_tiktoken::DecodeKeyError fmt string 'Invalid token for
decoding:' is relied upon here and must be updated if that implementation
changes" so future readers know this is an explicit contract tied to the
upstream implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 34abce70-ea5d-44f5-b69b-ac14c33fc48e

📥 Commits

Reviewing files that changed from the base of the PR and between 9eb42a1 and 9c48829.

📒 Files selected for processing (2)
  • crates/tokenizer/src/kimi_k2_tokenizer.rs
  • crates/tokenizer/src/tiktoken.rs

`load_tiktoken_config_from_dir` returns `(default, None)` when the file is
absent; every other test writes one, so the branch was uncovered.

Signed-off-by: Chang Su <[email protected]>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/tokenizer/src/tiktoken.rs (1)

514-517: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't present the tiktoken error prefix as a stable contract.

This helper still depends on an implementation-detail Display string from tiktoken-rs. If that formatting changes in a dependency bump, unknown IDs will miss this guard and drop back into the panic-prone fallback path. Please either document this as an INVARIANT: and pin it with a regression test, or soften the comment so it reads as a best-effort heuristic rather than a guaranteed contract.

In tiktoken-rs 0.9.1, is `DecodeKeyError` publicly re-exported, and is the `Display` prefix `Invalid token for decoding:` documented as a stable/public API contract?

Based on learnings, use INVARIANT: to document assumptions in safe Rust code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/tokenizer/src/tiktoken.rs` around lines 514 - 517, The code relies on
the tiktoken-rs DecodeKeyError Display prefix ("Invalid token for decoding:") as
a stable contract; either document this as an explicit INVARIANT and add a
regression test that asserts the current Display string (via the helper that
matches the Display output or by constructing vendor_tiktoken::DecodeKeyError
and calling fmt/display) so CI will catch future formatting changes, or relax
the comment to state this is a best-effort heuristic (not a guarantee) and
remove wording that treats the Display value as a stable API; update the comment
around the matching logic and add the unit test that fails if the Display prefix
changes (or remove hard equality checks in the helper to fall back more safely).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 514-517: The code relies on the tiktoken-rs DecodeKeyError Display
prefix ("Invalid token for decoding:") as a stable contract; either document
this as an explicit INVARIANT and add a regression test that asserts the current
Display string (via the helper that matches the Display output or by
constructing vendor_tiktoken::DecodeKeyError and calling fmt/display) so CI will
catch future formatting changes, or relax the comment to state this is a
best-effort heuristic (not a guarantee) and remove wording that treats the
Display value as a stable API; update the comment around the matching logic and
add the unit test that fails if the Display prefix changes (or remove hard
equality checks in the helper to fall back more safely).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4aec6834-3459-41f5-8834-04dd1729e9a5

📥 Commits

Reviewing files that changed from the base of the PR and between 9c48829 and 7000219.

📒 Files selected for processing (1)
  • crates/tokenizer/src/tiktoken.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiktoken decode panics on unmapped Kimi K2.5 reserved token IDs

1 participant