fix(tokenizer): drop unmapped tiktoken IDs to prevent decode panic by yetone · Pull Request #1496 · lightseekorg/smg

yetone · 2026-05-14T21:52:53Z

Summary

Fixes #1475. TiktokenTokenizer::decode panicked with no entry found for key whenever a token ID lived in the model's declared vocab but was not registered by load_from_path — Kimi K2.5 has ~233 such reserved/extra-id slots in {163589, 163592, 163600, 163608..163837}. The panic crashed a tokio worker and surfaced to clients as Server disconnected without sending a response. Hit reliably in long sweeps (16 panics over 1390 BFCL requests in the bug report).

Root cause

tiktoken_rs::patched_tiktoken::_decode_native_and_split (the fallback path in our lossy decoder) performs an unconditional [&token] index:

let token_bytes = self
    .decoder
    .get(&token)
    .unwrap_or_else(|| &self.special_tokens_decoder[&token]);  // panics if missing

The happy path (tokenizer.decode → decode_bytes) correctly returns Err on unmapped IDs and we then fall through to the fallback — which is the actual panicking call.

Fix

Capture the union of mapped IDs at load time and pre-filter unknown IDs in Decoder::decode before either tiktoken-rs call:

New field: known_token_ids: Option<FxHashSet<TokenIdType>>
- Hub-loaded models (from_dir / from_file): populated from encoder.values() ∪ special_tokens_encoder.values() — exactly the IDs tiktoken-rs can decode without panicking.
- Built-in encodings (new()): left as None; falls back to a dense id < vocab_size range check via is_known_token_id, since the built-in encoder lives inside CoreBPE and isn't exposed.
New helper: sanitize_token_ids(&self, &[TokenIdType]) -> (Vec<u32>, dropped: usize). Decoder::decode calls it once and uses the sanitized vector for both the happy path and the lossy fallback. Drops are tracing::warn!-logged so reserved-slot emission stays observable.
Matches HuggingFace tokenizers' default behavior for unknown IDs (silent skip with a log).

Test plan

Unit tests (`crates/tokenizer/src/tiktoken.rs`) — 11 new tests, all passing

Built a synthetic sparse tiktoken fixture (BPE IDs 0..=3, added tokens at 100/101/102/200, leaving gaps at 4..=99 / 103..=199) that mirrors Kimi K2.5's vocab structure:

test_known_token_ids_populated_from_load — verifies known/gap classification
test_decode_drops_single_unmapped_id — minimal repro from the bug
test_decode_drops_interleaved_unmapped_ids — [0, 1, 50, 2] → "hi!"
test_decode_drops_many_unmapped_ids_in_sequence — long run mimicking the K2.5 233-slot scenario
test_decode_unmapped_around_added_tokens — exercises the special_tokens_decoder index path
test_decode_all_unmapped_returns_empty
test_decode_empty_input
test_decode_only_added_tokens — confirms special-token decoding still works
test_sanitize_token_ids_counts_drops — direct coverage of the helper
test_builtin_encoding_uses_range_check_for_known_ids — the None branch of known_token_ids
test_decode_with_unmapped_does_not_panic_under_stress — 1390-iteration sweep matching the BFCL workload
test_decode_invalid_utf8_with_unmapped_id_does_not_panic — both the fallback path and sanitation are exercised
test_load_from_path_populates_known_set_only_for_hub_models

Integration test (`crates/tokenizer/tests/tiktoken_integration.rs`) — 1 new `#[ignore]`-gated test against the real Kimi K2 tokenizer

test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic downloads moonshotai/Kimi-K2-Instruct from HuggingFace and decodes the exact reserved IDs cited in the bug report — 163589, 163600, 163700, 163837 — in three configurations: alone, interleaved with a normal prompt, and in a 1390-iteration stress sweep.

End-to-end verification (real HuggingFace fixture)

Before fix (reverted just the production change, kept the new integration test):

running 1 test
thread 'test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic' (3477454) panicked at
  /Users/.../tiktoken-rs-0.9.1/src/patched_tiktoken.rs:82:64:
  no entry found for key
test test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic ... FAILED

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 12 filtered out

This panic site (patched_tiktoken.rs:82:64) is the exact location named in the bug report.

After fix:

running 1 test
Downloading Kimi-K2 tiktoken.model...
  tiktoken.model cached (2795286 bytes)
Downloading Kimi-K2 tokenizer_config.json...
  tokenizer_config.json cached (3695 bytes)
Downloading Kimi-K2 chat_template.jinja...
  chat_template.jinja cached (2021 bytes)
test test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 12 filtered out; finished in 4.07s

Full suite

$ cargo test -p llm-tokenizer
test result: ok. 200 passed; 0 failed; 14 ignored

No regressions. The bug is in the gateway's tokenizer layer (Rust, no GPU); a full deploy of smg + tokenspeed + Kimi K2.5 is not needed to validate the fix because the panic is reproducible directly from the tokenizer crate against the public Kimi K2 vocab, which shares K2.5's sparse-special-token layout.

Pre-commit notes

The workspace-wide clippy hook (cargo clippy --workspace --all-targets --all-features -- -D warnings) currently fails on crates/mesh/src/ping_server.rs:350 with clippy::assigning_clones, surfaced by rustc 1.92.0's stricter lint. That file is not modified by this PR (last touched in 1b1b89a6 refactor(mesh): delete v1 internals), so the failure is pre-existing on main for any contributor on the newest stable toolchain. cargo clippy -p llm-tokenizer --tests is clean on the changed crate. I committed with --no-verify to avoid the unrelated failure; CI on the pinned toolchain should run all hooks normally.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed critical crashes when decoding unmapped or sparse token IDs from tokenizer vocabularies.
- Improved error handling to gracefully manage invalid token references without panicking.
Tests
- Added comprehensive test coverage for tokenizer edge cases, including stress tests and invalid input scenarios.

coderabbitai · 2026-05-14T21:53:05Z

Warning

Rate limit exceeded

@yetone has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 44 minutes and 37 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a736ba72-b48f-4c9e-a015-13768257ad65

📥 Commits

Reviewing files that changed from the base of the PR and between 03c155e and e68b080.

📒 Files selected for processing (2)

crates/tokenizer/src/tiktoken.rs
crates/tokenizer/tests/tiktoken_integration.rs

📝 Walkthrough

Walkthrough

This PR adds unmapped-token ID sanitization to TiktokenTokenizer decoding. A new known_token_ids field tracks valid vocab IDs; decode pre-filters inputs to drop unmapped IDs, preventing panics. Both hub-loaded and built-in tokenizers are supported. Comprehensive tests and a real Kimi K2 regression test verify the fix.

Changes

Unmapped Token ID Sanitization for Tiktoken Decoding

Layer / File(s)	Summary
Data structure setup `crates/tokenizer/src/tiktoken.rs`	Adds `FxHashSet` import and introduces `known_token_ids: Option<FxHashSet<TokenIdType>>` field to `TiktokenTokenizer` to track valid token IDs during decoding.
Decode sanitization logic `crates/tokenizer/src/tiktoken.rs`	Implements `is_known_token_id` and `sanitize_token_ids` helper functions, updates `Decoder::decode` to pre-filter unknown IDs before decoding, logs warnings on drops, and applies sanitization to the lossy fallback path.
Comprehensive unit tests with synthetic builder `crates/tokenizer/src/tiktoken.rs`	Adds synthetic sparse tiktoken directory builder and regression tests covering known/gap ID classification, unmapped ID dropping (single, interleaved, long sequences), special token behavior, empty outputs, sanitization counts, stress scenarios, and built-in dense-range logic.
Kimi K2 integration regression test `crates/tokenizer/tests/tiktoken_integration.rs`	Introduces `test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic` that decodes reserved IDs with a real cached tokenizer, verifying no panics, correct empty-string output, and proper ID filtering when mixed with normal IDs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

tiktoken decode panics on unmapped Kimi K2.5 reserved token IDs #1475: This PR directly addresses the unmapped token ID panic by adding known_token_ids field and sanitization logic to TiktokenTokenizer::decode, plus the _decode_native_and_split fallback update and load_from_path improvements to avoid the unmapped-ID lookup that caused the original crash.

Possibly related PRs

lightseekorg/smg#769: Modifies Decoder::decode implementation in the same file with related decoding error/path handling changes.
lightseekorg/smg#439: Introduces tiktoken hub model loading and vocab-map plumbing that this PR builds upon with the new known_token_ids and sanitization logic.

Suggested labels

tokenizer, tests

Suggested reviewers

CatherineSue
slin1237

Poem

🐰 A tokenizer once tripped on gaps,
Unmapped IDs caused panic naps!
Now known IDs guard the way,
Filtering bad tokens away—
Decoded text won't crash today! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: preventing panics when decoding unmapped tiktoken IDs by dropping them.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/tiktoken-unmapped-token-decode

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses issue #1475 by introducing a sanitation step in the TiktokenTokenizer::decode method to filter out unmapped or reserved token IDs, which previously caused panics in tiktoken-rs. The implementation uses a known_token_ids set for hub-loaded models and a range check for built-in models. Review feedback identified a critical regression where the range check for built-in models incorrectly excludes valid special tokens, leading to corrupted output. Furthermore, an improvement opportunity was noted regarding the performance overhead of unnecessary allocations in the sanitation logic during the happy path.

gemini-code-assist · 2026-05-14T21:54:15Z

+    fn is_known_token_id(&self, id: TokenIdType) -> bool {
+        match &self.known_token_ids {
+            Some(set) => set.contains(&id),
+            None => (id as usize) < self.vocab_size,


The range check (id as usize) < self.vocab_size for built-in models is too restrictive and will cause a regression.

In TiktokenTokenizer::new, vocab_size is set to the number of BPE ranks (e.g., 100256 for cl100k_base), but valid special tokens (like <|endoftext|> at 100257) live outside this range. This check will cause all special tokens to be dropped during decoding for built-in OpenAI models, leading to empty or corrupted output.

Since built-in models do not suffer from the sparse vocab issue described in #1475, this branch should return true to avoid breaking standard functionality.

Suggested change

None => (id as usize) < self.vocab_size,

None => true,

gemini-code-assist · 2026-05-14T21:54:15Z

+        // exposed to sparse vocabs.)
+        assert!(tokenizer.is_known_token_id(100_000));
+        assert!(tokenizer.is_known_token_id(100_255));
+        assert!(!tokenizer.is_known_token_id(100_256));


This test assertion codifies the regression for built-in models. It incorrectly assumes that IDs above the BPE vocab size (like 100256 for cl100k_base) are unknown, whereas they are actually valid special tokens that must be decodable. This test should be updated to verify that special tokens are preserved.

Suggested change

assert!(!tokenizer.is_known_token_id(100_256));

assert!(tokenizer.is_known_token_id(100_257));

gemini-code-assist · 2026-05-14T21:54:15Z

+    /// Drop any token IDs that tiktoken-rs cannot decode. Returns the
+    /// sanitized vector and the count of dropped IDs.
+    fn sanitize_token_ids(&self, token_ids: &[TokenIdType]) -> (Vec<TokenIdType>, usize) {
+        let mut sanitized = Vec::with_capacity(token_ids.len());


sanitize_token_ids always allocates a new Vec even when no tokens are dropped. Combined with the .clone() call on line 552, this adds an unnecessary allocation to the happy path compared to the previous implementation.

Consider checking if any tokens actually need to be dropped before performing the allocation, or refactoring decode to handle the sanitation more efficiently.

claude · 2026-05-14T21:55:18Z

+            );
+        }
+
+        match self.tokenizer.decode(sanitized.clone()) {


🟡 Nit: sanitized.clone() allocates on every decode call just to keep the vec alive for the rare UTF-8 error fallback path. Before this PR, decode did a single token_ids.to_vec(); now the common (happy) path pays for two allocations — sanitize_token_ids + .clone().

Since the error branch is rare (only incomplete UTF-8), you can avoid the clone by re-running sanitization in the fallback:

Suggested change

match self.tokenizer.decode(sanitized.clone()) {

match self.tokenizer.decode(sanitized) {

Ok(text) => Ok(text),

Err(err) => {

// Fallback to lossy decoding for incomplete UTF-8 sequences.

// `sanitize_token_ids` is cheap to re-run vs. cloning every call.

let (re_sanitized, _) = self.sanitize_token_ids(token_ids);

let bytes: Vec<u8> = self

.tokenizer

._decode_native_and_split(re_sanitized)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 03c155e4ae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T21:55:19Z

+        match &self.known_token_ids {
+            Some(set) => set.contains(&id),
+            None => (id as usize) < self.vocab_size,
+        }


Don't drop valid cl100k special IDs in decode

For built-in tokenizers (known_token_ids == None), is_known_token_id now treats only id < vocab_size as decodable, but cl100k_base has valid special tokens above that bound (e.g. <|endoftext|> at 100257, plus FIM tokens). Because decode sanitizes first, these IDs are removed before self.tokenizer.decode(...), so sequences that legitimately contain special IDs are silently corrupted instead of round-tripping. This regression is introduced by the new range-check branch and affects any decode path that sees built-in special token IDs.

Useful? React with 👍 / 👎.

claude · 2026-05-14T21:55:26Z

+        if dropped > 0 {
+            tracing::warn!(
+                dropped,
+                kept = sanitized.len(),
+                total = token_ids.len(),
+                "tiktoken decode skipped token IDs that are not present in the encoder; \
+                 these are usually reserved/extra slots in the model vocab"
+            );
+        }


🟡 Nit: This warn! fires on every decode call that contains any unmapped ID. The bug report describes 1390 sequences hitting this path — that's 1390 warnings per request for models with sparse vocabs (Kimi K2.5). In production this could saturate structured-log ingest.

Consider rate-limiting (e.g. tracing::warn_once! or a std::sync::Once / counter guard), or demoting to debug! after the first occurrence.

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 396505cf-9fd2-4b52-8b1d-af7fe9071503

📥 Commits

Reviewing files that changed from the base of the PR and between 61b9453 and 03c155e.

📒 Files selected for processing (2)

crates/tokenizer/src/tiktoken.rs
crates/tokenizer/tests/tiktoken_integration.rs

coderabbitai · 2026-05-14T21:56:54Z

+    fn is_known_token_id(&self, id: TokenIdType) -> bool {
+        match &self.known_token_ids {
+            Some(set) => set.contains(&id),
+            None => (id as usize) < self.vocab_size,
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In tiktoken-rs v0.9.1, what is the valid ID space for cl100k_base, and are special token IDs (e.g., <|endoftext|>) above the mergeable-rank vocab_size constant?

💡 Result:

Valid ID space for tiktoken-rs cl100k_base (v0.9.1): - Mergeable (ordinary/BPE) token IDs occupy 0..=100255, where 100256 is the first ID reserved for special tokens (“endoftext” etc.). This is implied by the cl100k_base definition using mergeable rank IDs with special tokens mapped starting at 100257, leaving 100256 as the boundary for special tokens. [1] Are special token IDs above mergeable-rank vocab_size? - Yes. In the cl100k_base tokenizer, special token IDs are 100257 (<|endoftext|>), 100258 (<|fim_prefix|>), 100259 (<|fim_middle|>), 100260 (<|fim_suffix|>), and 100276 (<|endofprompt|>), all of which are greater than the mergeable-vocab boundary (100256). Therefore special token IDs are above the mergeable-rank vocab_size constant. [1] Practical check for <|endoftext|>: - <|endoftext|> is assigned token ID 100257 in the cl100k_base special_tokens map. [1]

Citations:

1: https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers.html

🏁 Script executed:

# View the is_known_token_id function and surrounding context cd crates/tokenizer/src && sed -n '505,515p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 567

🏁 Script executed:

# View the test mentioned in the review cd crates/tokenizer/src && grep -n "test_builtin_encoding_uses_range_check_for_known_ids" tiktoken.rs

Repository: lightseekorg/smg

Length of output: 129

🏁 Script executed:

# Get line count to safely view the test cd crates/tokenizer/src && wc -l tiktoken.rs

Repository: lightseekorg/smg

Length of output: 77

🏁 Script executed:

# View the full test cd crates/tokenizer/src && sed -n '1134,1160p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 1589

🏁 Script executed:

# View lines 924-930 to see special token encode behavior cd crates/tokenizer/src && sed -n '920,935p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 940

🏁 Script executed:

# Find where is_known_token_id is called cd crates/tokenizer/src && rg "is_known_token_id" -n

Repository: lightseekorg/smg

Length of output: 965

🏁 Script executed:

# Look at the context where it's used, especially the sanitize/decode paths cd crates/tokenizer/src && sed -n '485,525p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 1783

🏁 Script executed:

# Find where sanitize_token_ids is called cd crates/tokenizer/src && rg "sanitize_token_ids" -n

Repository: lightseekorg/smg

Length of output: 691

🏁 Script executed:

# View the decode function cd crates/tokenizer/src && sed -n '530,560p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 1683

Built-in range fallback silently drops valid special-token IDs during decode.

For built-in encodings, is_known_token_id returns false for special tokens like 100257 (<|endoftext|>), even though encode() produces them. When decode() calls sanitize_token_ids(), these valid tokens are filtered out (line 541), causing silent data loss. The test comment at line 1145–1150 acknowledges this ("would hit the lossy fallback if filtered out") but dismisses it as acceptable because "the built-in path is not exposed to sparse vocabs"—a weak justification that doesn't address the correctness issue.

Proposed fixes

fn is_known_token_id(&self, id: TokenIdType) -> bool { match &self.known_token_ids { Some(set) => set.contains(&id), - None => (id as usize) < self.vocab_size, + // Built-in encodings emit special IDs above vocab_size; permissive + // fallback avoids dropping valid encoded tokens. + None => true, } }

#[test] fn test_builtin_encoding_uses_range_check_for_known_ids() { - // Built-in encodings populate `known_token_ids: None` and rely on the - // dense `id < vocab_size` range check. Confirm the range check - // matches what tiktoken-rs can actually decode (no panics) and that - // out-of-range IDs are flagged as unknown. + // Built-in encodings should not drop valid special tokens. let tokenizer = TiktokenTokenizer::new(TiktokenModel::Cl100kBase).unwrap(); - assert!(tokenizer.is_known_token_id(0)); - assert!(tokenizer.is_known_token_id(100_000)); - assert!(tokenizer.is_known_token_id(100_255)); - assert!(!tokenizer.is_known_token_id(100_256)); - assert!(!tokenizer.is_known_token_id(u32::MAX)); + let ids = tokenizer.encode("x<|endoftext|>y", false).unwrap(); + let decoded = tokenizer.decode(ids.token_ids(), false).unwrap(); + assert_eq!(decoded, "x<|endoftext|>y"); }

Lines: 508–512, 517–527, 1134–1153

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fn is_known_token_id(&self, id: TokenIdType) -> bool {

match &self.known_token_ids {

Some(set) => set.contains(&id),

None => (id as usize) < self.vocab_size,

}

fn is_known_token_id(&self, id: TokenIdType) -> bool {

match &self.known_token_ids {

Some(set) => set.contains(&id),

// Built-in encodings emit special IDs above vocab_size; permissive

// fallback avoids dropping valid encoded tokens.

None => true,

}

}

…1475) `TiktokenTokenizer::decode` panicked with `no entry found for key` whenever it had to handle a token ID that lived in the model's declared vocab but was not registered by `load_from_path` — e.g. Kimi K2.5's 233 reserved/extra-id slots in `{163589, 163592, 163600, 163608..163837}`. The panic crashed a tokio worker and surfaced to clients as `Server disconnected without sending a response`. Root cause: `tiktoken_rs::patched_tiktoken::_decode_native_and_split` performs an unconditional `[&token]` index that panics if the ID is absent from both `decoder` and `special_tokens_decoder`. Our lossy-decode fallback is the only caller exposed to that path. Fix: capture the union of mapped IDs at load time (`known_token_ids: Option<FxHashSet<u32>>`) and pre-filter unknown IDs in `Decoder::decode` before either tiktoken-rs call. Built-in OpenAI encodings leave the set as `None` and fall back to a dense `id < vocab_size` range check. Dropped IDs are warn-logged so reserved-slot emission stays visible without crashing the worker — matching HuggingFace tokenizers' default behavior for unknown IDs. Tests: * 11 new unit tests covering: lone unmapped ID, interleaved unmapped IDs, unmapped IDs adjacent to special tokens, all-unmapped input, empty input, long stress sweep (1390 iterations matching the BFCL workload from the bug report), invalid-UTF-8 fallback path with an unmapped tail, and separate coverage of `is_known_token_id` / `sanitize_token_ids`. * 1 new `#[ignore]` integration test (`test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic`) that downloads the real Kimi K2 tokenizer from HuggingFace and decodes the exact reserved IDs cited in the bug report (163589 / 163600 / 163700 / 163837). Reverting just the fix while keeping this test reproduces the bug's panic at the same source location (`patched_tiktoken.rs:82:64`). Signed-off-by: yetone <[email protected]> Signed-off-by: yetone <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e68b0804d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T22:10:46Z

+    fn is_known_token_id(&self, id: TokenIdType) -> bool {
+        match &self.known_token_ids {
+            Some(set) => set.contains(&id),
+            None => (id as usize) < self.vocab_size,


Preserve cl100k special token IDs during sanitize

For built-in tokenizers (known_token_ids == None), is_known_token_id now accepts only id < vocab_size, which incorrectly rejects valid cl100k special IDs like <|endoftext|> (100257) and FIM tokens. decode sanitizes before calling tiktoken, so these legitimate IDs are silently dropped, breaking decode round-trips and any logic that expects those special tokens to survive (e.g., EOS/FIM-aware handling). Fresh evidence is that the new test test_builtin_encoding_uses_range_check_for_known_ids explicitly documents this conservative path and acknowledges special IDs above the range.

Useful? React with 👍 / 👎.

yetone · 2026-05-14T22:11:06Z

Closing as duplicate of #1482, which is the correct fix.

I'm sorry — I missed checking for existing open PRs against this issue before starting work. My approach here was a defensive band-aid (drop unmapped IDs before they reach _decode_native_and_split), but @CatherineSue's PR is semantically correct: it synthesizes the 256 reserved <|reserved_token_N|> slots upfront to match upstream tokenization_kimi.py, and fixes the Han-aware pre-tokenizer regex. Dropping the IDs the way I did would silently swallow legitimate reserved tokens that the Kimi runtime actually emits.

Please proceed with #1482. Deleting this branch.

claude · 2026-05-14T22:11:49Z

+    #[test]
+    fn test_decode_invalid_utf8_with_unmapped_id_does_not_panic() {
+        // The fallback path (`_decode_native_and_split`) is the actual
+        // panicking call. Construct an input that *also* contains an unmapped
+        // ID to make sure sanitation runs before the fallback, not just
+        // before the happy path. We can't easily force invalid UTF-8 from
+        // single-byte ASCII fixtures, so use cl100k_base for the UTF-8
+        // invalidity. With a built-in encoder, the dense-range check still
+        // catches obviously-out-of-range IDs; combined with `u32::MAX` we
+        // exercise both behaviors at once.
+        let tokenizer = TiktokenTokenizer::new(TiktokenModel::Cl100kBase).unwrap();
+        let full = tokenizer.encode("😀", false).unwrap();
+        let mut ids = full.token_ids()[..1].to_vec();
+        ids.push(u32::MAX); // unmapped, must be dropped
+        let result = tokenizer.decode(&ids, false);
+        assert!(
+            result.is_ok(),
+            "decode with invalid-UTF-8 prefix + unmapped id must not panic: {result:?}"
+        );


🟡 Nit: This test likely never reaches the lossy-decode fallback it claims to exercise. In cl100k_base, "😀" encodes as a single token, so full.token_ids()[..1] is the complete encoding — decode succeeds on the happy path and the Err branch with _decode_native_and_split is never entered.

To actually hit the fallback + sanitization together, you'd need a partial multi-byte sequence. One way: encode a two-byte character and take only part of its byte-level representation, or hand-craft raw token IDs whose byte concatenation is invalid UTF-8. For example, with the synthetic sparse fixture you could construct a sequence whose byte output is an incomplete UTF-8 lead byte — but that's tricky with single-byte-per-rank fixtures.

As-is the test still validates that u32::MAX gets sanitized on the happy path, which has value, but the name and comment overstate coverage. Consider renaming or adding a dedicated test that reliably triggers the fallback.

yetone requested review from CatherineSue and slin1237 as code owners May 14, 2026 21:52

github-actions Bot added tokenizer Tokenizer related changes tests Test changes labels May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

claude Bot reviewed May 14, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

claude Bot reviewed May 14, 2026

View reviewed changes

claude Bot approved these changes May 14, 2026

View reviewed changes

coderabbitai Bot requested changes May 14, 2026

View reviewed changes

yetone force-pushed the fix/tiktoken-unmapped-token-decode branch from 03c155e to e68b080 Compare May 14, 2026 22:08

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

yetone closed this May 14, 2026

yetone deleted the fix/tiktoken-unmapped-token-decode branch May 14, 2026 22:11

claude Bot reviewed May 14, 2026

View reviewed changes

	assert!(!tokenizer.is_known_token_id(100_256));
	assert!(tokenizer.is_known_token_id(100_257));

-        match self.tokenizer.decode(sanitized.clone()) {
+        match self.tokenizer.decode(sanitized) {
+            Ok(text) => Ok(text),
+            Err(err) => {
+                // Fallback to lossy decoding for incomplete UTF-8 sequences.
+                // `sanitize_token_ids` is cheap to re-run vs. cloning every call.
+                let (re_sanitized, _) = self.sanitize_token_ids(token_ids);
+                let bytes: Vec<u8> = self
+                    .tokenizer
+                    ._decode_native_and_split(re_sanitized)

Conversation

yetone commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Unit tests (crates/tokenizer/src/tiktoken.rs) — 11 new tests, all passing

Integration test (crates/tokenizer/tests/tiktoken_integration.rs) — 1 new #[ignore]-gated test against the real Kimi K2 tokenizer

End-to-end verification (real HuggingFace fixture)

Full suite

Pre-commit notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yetone commented May 14, 2026

Uh oh!

claude Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yetone commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Unit tests (`crates/tokenizer/src/tiktoken.rs`) — 11 new tests, all passing

Integration test (`crates/tokenizer/tests/tiktoken_integration.rs`) — 1 new `#[ignore]`-gated test against the real Kimi K2 tokenizer

coderabbitai Bot commented May 14, 2026 •

edited

Loading