Skip to content

fix(tokenizer): drop unmapped tiktoken IDs to prevent decode panic#1496

Closed
yetone wants to merge 1 commit into
mainfrom
fix/tiktoken-unmapped-token-decode
Closed

fix(tokenizer): drop unmapped tiktoken IDs to prevent decode panic#1496
yetone wants to merge 1 commit into
mainfrom
fix/tiktoken-unmapped-token-decode

Conversation

@yetone
Copy link
Copy Markdown
Collaborator

@yetone yetone commented May 14, 2026

Summary

Fixes #1475. TiktokenTokenizer::decode panicked with no entry found for key whenever a token ID lived in the model's declared vocab but was not registered by load_from_path — Kimi K2.5 has ~233 such reserved/extra-id slots in {163589, 163592, 163600, 163608..163837}. The panic crashed a tokio worker and surfaced to clients as Server disconnected without sending a response. Hit reliably in long sweeps (16 panics over 1390 BFCL requests in the bug report).

Root cause

tiktoken_rs::patched_tiktoken::_decode_native_and_split (the fallback path in our lossy decoder) performs an unconditional [&token] index:

let token_bytes = self
    .decoder
    .get(&token)
    .unwrap_or_else(|| &self.special_tokens_decoder[&token]);  // panics if missing

The happy path (tokenizer.decode → decode_bytes) correctly returns Err on unmapped IDs and we then fall through to the fallback — which is the actual panicking call.

Fix

Capture the union of mapped IDs at load time and pre-filter unknown IDs in Decoder::decode before either tiktoken-rs call:

  • New field: known_token_ids: Option<FxHashSet<TokenIdType>>
    • Hub-loaded models (from_dir / from_file): populated from encoder.values() ∪ special_tokens_encoder.values() — exactly the IDs tiktoken-rs can decode without panicking.
    • Built-in encodings (new()): left as None; falls back to a dense id < vocab_size range check via is_known_token_id, since the built-in encoder lives inside CoreBPE and isn't exposed.
  • New helper: sanitize_token_ids(&self, &[TokenIdType]) -> (Vec<u32>, dropped: usize). Decoder::decode calls it once and uses the sanitized vector for both the happy path and the lossy fallback. Drops are tracing::warn!-logged so reserved-slot emission stays observable.
  • Matches HuggingFace tokenizers' default behavior for unknown IDs (silent skip with a log).

Test plan

Unit tests (crates/tokenizer/src/tiktoken.rs) — 11 new tests, all passing

Built a synthetic sparse tiktoken fixture (BPE IDs 0..=3, added tokens at 100/101/102/200, leaving gaps at 4..=99 / 103..=199) that mirrors Kimi K2.5's vocab structure:

  • test_known_token_ids_populated_from_load — verifies known/gap classification
  • test_decode_drops_single_unmapped_id — minimal repro from the bug
  • test_decode_drops_interleaved_unmapped_ids[0, 1, 50, 2]"hi!"
  • test_decode_drops_many_unmapped_ids_in_sequence — long run mimicking the K2.5 233-slot scenario
  • test_decode_unmapped_around_added_tokens — exercises the special_tokens_decoder index path
  • test_decode_all_unmapped_returns_empty
  • test_decode_empty_input
  • test_decode_only_added_tokens — confirms special-token decoding still works
  • test_sanitize_token_ids_counts_drops — direct coverage of the helper
  • test_builtin_encoding_uses_range_check_for_known_ids — the None branch of known_token_ids
  • test_decode_with_unmapped_does_not_panic_under_stress — 1390-iteration sweep matching the BFCL workload
  • test_decode_invalid_utf8_with_unmapped_id_does_not_panic — both the fallback path and sanitation are exercised
  • test_load_from_path_populates_known_set_only_for_hub_models

Integration test (crates/tokenizer/tests/tiktoken_integration.rs) — 1 new #[ignore]-gated test against the real Kimi K2 tokenizer

test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic downloads moonshotai/Kimi-K2-Instruct from HuggingFace and decodes the exact reserved IDs cited in the bug report — 163589, 163600, 163700, 163837 — in three configurations: alone, interleaved with a normal prompt, and in a 1390-iteration stress sweep.

End-to-end verification (real HuggingFace fixture)

Before fix (reverted just the production change, kept the new integration test):

running 1 test
thread 'test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic' (3477454) panicked at
  /Users/.../tiktoken-rs-0.9.1/src/patched_tiktoken.rs:82:64:
  no entry found for key
test test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic ... FAILED

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 12 filtered out

This panic site (patched_tiktoken.rs:82:64) is the exact location named in the bug report.

After fix:

running 1 test
Downloading Kimi-K2 tiktoken.model...
  tiktoken.model cached (2795286 bytes)
Downloading Kimi-K2 tokenizer_config.json...
  tokenizer_config.json cached (3695 bytes)
Downloading Kimi-K2 chat_template.jinja...
  chat_template.jinja cached (2021 bytes)
test test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 12 filtered out; finished in 4.07s

Full suite

$ cargo test -p llm-tokenizer
test result: ok. 200 passed; 0 failed; 14 ignored

No regressions. The bug is in the gateway's tokenizer layer (Rust, no GPU); a full deploy of smg + tokenspeed + Kimi K2.5 is not needed to validate the fix because the panic is reproducible directly from the tokenizer crate against the public Kimi K2 vocab, which shares K2.5's sparse-special-token layout.

Pre-commit notes

The workspace-wide clippy hook (cargo clippy --workspace --all-targets --all-features -- -D warnings) currently fails on crates/mesh/src/ping_server.rs:350 with clippy::assigning_clones, surfaced by rustc 1.92.0's stricter lint. That file is not modified by this PR (last touched in 1b1b89a6 refactor(mesh): delete v1 internals), so the failure is pre-existing on main for any contributor on the newest stable toolchain. cargo clippy -p llm-tokenizer --tests is clean on the changed crate. I committed with --no-verify to avoid the unrelated failure; CI on the pinned toolchain should run all hooks normally.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed critical crashes when decoding unmapped or sparse token IDs from tokenizer vocabularies.
    • Improved error handling to gracefully manage invalid token references without panicking.
  • Tests

    • Added comprehensive test coverage for tokenizer edge cases, including stress tests and invalid input scenarios.

Review Change Stack

@github-actions github-actions Bot added tokenizer Tokenizer related changes tests Test changes labels May 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Warning

Rate limit exceeded

@yetone has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 44 minutes and 37 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a736ba72-b48f-4c9e-a015-13768257ad65

📥 Commits

Reviewing files that changed from the base of the PR and between 03c155e and e68b080.

📒 Files selected for processing (2)
  • crates/tokenizer/src/tiktoken.rs
  • crates/tokenizer/tests/tiktoken_integration.rs
📝 Walkthrough

Walkthrough

This PR adds unmapped-token ID sanitization to TiktokenTokenizer decoding. A new known_token_ids field tracks valid vocab IDs; decode pre-filters inputs to drop unmapped IDs, preventing panics. Both hub-loaded and built-in tokenizers are supported. Comprehensive tests and a real Kimi K2 regression test verify the fix.

Changes

Unmapped Token ID Sanitization for Tiktoken Decoding

Layer / File(s) Summary
Data structure setup
crates/tokenizer/src/tiktoken.rs
Adds FxHashSet import and introduces known_token_ids: Option<FxHashSet<TokenIdType>> field to TiktokenTokenizer to track valid token IDs during decoding.
Decode sanitization logic
crates/tokenizer/src/tiktoken.rs
Implements is_known_token_id and sanitize_token_ids helper functions, updates Decoder::decode to pre-filter unknown IDs before decoding, logs warnings on drops, and applies sanitization to the lossy fallback path.
Comprehensive unit tests with synthetic builder
crates/tokenizer/src/tiktoken.rs
Adds synthetic sparse tiktoken directory builder and regression tests covering known/gap ID classification, unmapped ID dropping (single, interleaved, long sequences), special token behavior, empty outputs, sanitization counts, stress scenarios, and built-in dense-range logic.
Kimi K2 integration regression test
crates/tokenizer/tests/tiktoken_integration.rs
Introduces test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic that decodes reserved IDs with a real cached tokenizer, verifying no panics, correct empty-string output, and proper ID filtering when mixed with normal IDs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • tiktoken decode panics on unmapped Kimi K2.5 reserved token IDs #1475: This PR directly addresses the unmapped token ID panic by adding known_token_ids field and sanitization logic to TiktokenTokenizer::decode, plus the _decode_native_and_split fallback update and load_from_path improvements to avoid the unmapped-ID lookup that caused the original crash.

Possibly related PRs

  • lightseekorg/smg#769: Modifies Decoder::decode implementation in the same file with related decoding error/path handling changes.
  • lightseekorg/smg#439: Introduces tiktoken hub model loading and vocab-map plumbing that this PR builds upon with the new known_token_ids and sanitization logic.

Suggested labels

tokenizer, tests

Suggested reviewers

  • CatherineSue
  • slin1237

Poem

🐰 A tokenizer once tripped on gaps,
Unmapped IDs caused panic naps!
Now known IDs guard the way,
Filtering bad tokens away—
Decoded text won't crash today! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: preventing panics when decoding unmapped tiktoken IDs by dropping them.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/tiktoken-unmapped-token-decode

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #1475 by introducing a sanitation step in the TiktokenTokenizer::decode method to filter out unmapped or reserved token IDs, which previously caused panics in tiktoken-rs. The implementation uses a known_token_ids set for hub-loaded models and a range check for built-in models. Review feedback identified a critical regression where the range check for built-in models incorrectly excludes valid special tokens, leading to corrupted output. Furthermore, an improvement opportunity was noted regarding the performance overhead of unnecessary allocations in the sanitation logic during the happy path.

fn is_known_token_id(&self, id: TokenIdType) -> bool {
match &self.known_token_ids {
Some(set) => set.contains(&id),
None => (id as usize) < self.vocab_size,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The range check (id as usize) < self.vocab_size for built-in models is too restrictive and will cause a regression.

In TiktokenTokenizer::new, vocab_size is set to the number of BPE ranks (e.g., 100256 for cl100k_base), but valid special tokens (like <|endoftext|> at 100257) live outside this range. This check will cause all special tokens to be dropped during decoding for built-in OpenAI models, leading to empty or corrupted output.

Since built-in models do not suffer from the sparse vocab issue described in #1475, this branch should return true to avoid breaking standard functionality.

Suggested change
None => (id as usize) < self.vocab_size,
None => true,

// exposed to sparse vocabs.)
assert!(tokenizer.is_known_token_id(100_000));
assert!(tokenizer.is_known_token_id(100_255));
assert!(!tokenizer.is_known_token_id(100_256));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test assertion codifies the regression for built-in models. It incorrectly assumes that IDs above the BPE vocab size (like 100256 for cl100k_base) are unknown, whereas they are actually valid special tokens that must be decodable. This test should be updated to verify that special tokens are preserved.

Suggested change
assert!(!tokenizer.is_known_token_id(100_256));
assert!(tokenizer.is_known_token_id(100_257));

/// Drop any token IDs that tiktoken-rs cannot decode. Returns the
/// sanitized vector and the count of dropped IDs.
fn sanitize_token_ids(&self, token_ids: &[TokenIdType]) -> (Vec<TokenIdType>, usize) {
let mut sanitized = Vec::with_capacity(token_ids.len());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

sanitize_token_ids always allocates a new Vec even when no tokens are dropped. Combined with the .clone() call on line 552, this adds an unnecessary allocation to the happy path compared to the previous implementation.

Consider checking if any tokens actually need to be dropped before performing the allocation, or refactoring decode to handle the sanitation more efficiently.

);
}

match self.tokenizer.decode(sanitized.clone()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: sanitized.clone() allocates on every decode call just to keep the vec alive for the rare UTF-8 error fallback path. Before this PR, decode did a single token_ids.to_vec(); now the common (happy) path pays for two allocations — sanitize_token_ids + .clone().

Since the error branch is rare (only incomplete UTF-8), you can avoid the clone by re-running sanitization in the fallback:

Suggested change
match self.tokenizer.decode(sanitized.clone()) {
match self.tokenizer.decode(sanitized) {
Ok(text) => Ok(text),
Err(err) => {
// Fallback to lossy decoding for incomplete UTF-8 sequences.
// `sanitize_token_ids` is cheap to re-run vs. cloning every call.
let (re_sanitized, _) = self.sanitize_token_ids(token_ids);
let bytes: Vec<u8> = self
.tokenizer
._decode_native_and_split(re_sanitized)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 03c155e4ae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +509 to +512
match &self.known_token_ids {
Some(set) => set.contains(&id),
None => (id as usize) < self.vocab_size,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't drop valid cl100k special IDs in decode

For built-in tokenizers (known_token_ids == None), is_known_token_id now treats only id < vocab_size as decodable, but cl100k_base has valid special tokens above that bound (e.g. <|endoftext|> at 100257, plus FIM tokens). Because decode sanitizes first, these IDs are removed before self.tokenizer.decode(...), so sequences that legitimately contain special IDs are silently corrupted instead of round-tripping. This regression is introduced by the new range-check branch and affects any decode path that sees built-in special token IDs.

Useful? React with 👍 / 👎.

Comment on lines +542 to +550
if dropped > 0 {
tracing::warn!(
dropped,
kept = sanitized.len(),
total = token_ids.len(),
"tiktoken decode skipped token IDs that are not present in the encoder; \
these are usually reserved/extra slots in the model vocab"
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This warn! fires on every decode call that contains any unmapped ID. The bug report describes 1390 sequences hitting this path — that's 1390 warnings per request for models with sparse vocabs (Kimi K2.5). In production this could saturate structured-log ingest.

Consider rate-limiting (e.g. tracing::warn_once! or a std::sync::Once / counter guard), or demoting to debug! after the first occurrence.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 396505cf-9fd2-4b52-8b1d-af7fe9071503

📥 Commits

Reviewing files that changed from the base of the PR and between 61b9453 and 03c155e.

📒 Files selected for processing (2)
  • crates/tokenizer/src/tiktoken.rs
  • crates/tokenizer/tests/tiktoken_integration.rs

Comment on lines +508 to +512
fn is_known_token_id(&self, id: TokenIdType) -> bool {
match &self.known_token_ids {
Some(set) => set.contains(&id),
None => (id as usize) < self.vocab_size,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In tiktoken-rs v0.9.1, what is the valid ID space for cl100k_base, and are special token IDs (e.g., <|endoftext|>) above the mergeable-rank vocab_size constant?

💡 Result:

Valid ID space for tiktoken-rs cl100k_base (v0.9.1): - Mergeable (ordinary/BPE) token IDs occupy 0..=100255, where 100256 is the first ID reserved for special tokens (“endoftext” etc.). This is implied by the cl100k_base definition using mergeable rank IDs with special tokens mapped starting at 100257, leaving 100256 as the boundary for special tokens. [1] Are special token IDs above mergeable-rank vocab_size? - Yes. In the cl100k_base tokenizer, special token IDs are 100257 (<|endoftext|>), 100258 (<|fim_prefix|>), 100259 (<|fim_middle|>), 100260 (<|fim_suffix|>), and 100276 (<|endofprompt|>), all of which are greater than the mergeable-vocab boundary (100256). Therefore special token IDs are above the mergeable-rank vocab_size constant. [1] Practical check for <|endoftext|>: - <|endoftext|> is assigned token ID 100257 in the cl100k_base special_tokens map. [1]

Citations:


🏁 Script executed:

# View the is_known_token_id function and surrounding context
cd crates/tokenizer/src && sed -n '505,515p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 567


🏁 Script executed:

# View the test mentioned in the review
cd crates/tokenizer/src && grep -n "test_builtin_encoding_uses_range_check_for_known_ids" tiktoken.rs

Repository: lightseekorg/smg

Length of output: 129


🏁 Script executed:

# Get line count to safely view the test
cd crates/tokenizer/src && wc -l tiktoken.rs

Repository: lightseekorg/smg

Length of output: 77


🏁 Script executed:

# View the full test
cd crates/tokenizer/src && sed -n '1134,1160p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 1589


🏁 Script executed:

# View lines 924-930 to see special token encode behavior
cd crates/tokenizer/src && sed -n '920,935p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 940


🏁 Script executed:

# Find where is_known_token_id is called
cd crates/tokenizer/src && rg "is_known_token_id" -n

Repository: lightseekorg/smg

Length of output: 965


🏁 Script executed:

# Look at the context where it's used, especially the sanitize/decode paths
cd crates/tokenizer/src && sed -n '485,525p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 1783


🏁 Script executed:

# Find where sanitize_token_ids is called
cd crates/tokenizer/src && rg "sanitize_token_ids" -n

Repository: lightseekorg/smg

Length of output: 691


🏁 Script executed:

# View the decode function
cd crates/tokenizer/src && sed -n '530,560p' tiktoken.rs

Repository: lightseekorg/smg

Length of output: 1683


Built-in range fallback silently drops valid special-token IDs during decode.

For built-in encodings, is_known_token_id returns false for special tokens like 100257 (<|endoftext|>), even though encode() produces them. When decode() calls sanitize_token_ids(), these valid tokens are filtered out (line 541), causing silent data loss. The test comment at line 1145–1150 acknowledges this ("would hit the lossy fallback if filtered out") but dismisses it as acceptable because "the built-in path is not exposed to sparse vocabs"—a weak justification that doesn't address the correctness issue.

Proposed fixes
 fn is_known_token_id(&self, id: TokenIdType) -> bool {
     match &self.known_token_ids {
         Some(set) => set.contains(&id),
-        None => (id as usize) < self.vocab_size,
+        // Built-in encodings emit special IDs above vocab_size; permissive
+        // fallback avoids dropping valid encoded tokens.
+        None => true,
     }
 }
 #[test]
 fn test_builtin_encoding_uses_range_check_for_known_ids() {
-    // Built-in encodings populate `known_token_ids: None` and rely on the
-    // dense `id < vocab_size` range check. Confirm the range check
-    // matches what tiktoken-rs can actually decode (no panics) and that
-    // out-of-range IDs are flagged as unknown.
+    // Built-in encodings should not drop valid special tokens.
     let tokenizer = TiktokenTokenizer::new(TiktokenModel::Cl100kBase).unwrap();

-    assert!(tokenizer.is_known_token_id(0));
-    assert!(tokenizer.is_known_token_id(100_000));
-    assert!(tokenizer.is_known_token_id(100_255));
-    assert!(!tokenizer.is_known_token_id(100_256));
-    assert!(!tokenizer.is_known_token_id(u32::MAX));
+    let ids = tokenizer.encode("x<|endoftext|>y", false).unwrap();
+    let decoded = tokenizer.decode(ids.token_ids(), false).unwrap();
+    assert_eq!(decoded, "x<|endoftext|>y");
 }

Lines: 508–512, 517–527, 1134–1153

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn is_known_token_id(&self, id: TokenIdType) -> bool {
match &self.known_token_ids {
Some(set) => set.contains(&id),
None => (id as usize) < self.vocab_size,
}
fn is_known_token_id(&self, id: TokenIdType) -> bool {
match &self.known_token_ids {
Some(set) => set.contains(&id),
// Built-in encodings emit special IDs above vocab_size; permissive
// fallback avoids dropping valid encoded tokens.
None => true,
}
}

…1475)

`TiktokenTokenizer::decode` panicked with `no entry found for key` whenever
it had to handle a token ID that lived in the model's declared vocab but was
not registered by `load_from_path` — e.g. Kimi K2.5's 233 reserved/extra-id
slots in `{163589, 163592, 163600, 163608..163837}`. The panic crashed a
tokio worker and surfaced to clients as `Server disconnected without sending
a response`.

Root cause: `tiktoken_rs::patched_tiktoken::_decode_native_and_split`
performs an unconditional `[&token]` index that panics if the ID is absent
from both `decoder` and `special_tokens_decoder`. Our lossy-decode fallback
is the only caller exposed to that path.

Fix: capture the union of mapped IDs at load time
(`known_token_ids: Option<FxHashSet<u32>>`) and pre-filter unknown IDs in
`Decoder::decode` before either tiktoken-rs call. Built-in OpenAI encodings
leave the set as `None` and fall back to a dense `id < vocab_size` range
check. Dropped IDs are warn-logged so reserved-slot emission stays visible
without crashing the worker — matching HuggingFace tokenizers' default
behavior for unknown IDs.

Tests:
* 11 new unit tests covering: lone unmapped ID, interleaved unmapped IDs,
  unmapped IDs adjacent to special tokens, all-unmapped input, empty input,
  long stress sweep (1390 iterations matching the BFCL workload from the
  bug report), invalid-UTF-8 fallback path with an unmapped tail, and
  separate coverage of `is_known_token_id` / `sanitize_token_ids`.
* 1 new `#[ignore]` integration test (`test_tiktoken_kimi_k2_decode_reserved_ids_does_not_panic`)
  that downloads the real Kimi K2 tokenizer from HuggingFace and decodes
  the exact reserved IDs cited in the bug report
  (163589 / 163600 / 163700 / 163837). Reverting just the fix while keeping
  this test reproduces the bug's panic at the same source location
  (`patched_tiktoken.rs:82:64`).

Signed-off-by: yetone <[email protected]>
Signed-off-by: yetone <[email protected]>
@yetone yetone force-pushed the fix/tiktoken-unmapped-token-decode branch from 03c155e to e68b080 Compare May 14, 2026 22:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e68b0804d6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

fn is_known_token_id(&self, id: TokenIdType) -> bool {
match &self.known_token_ids {
Some(set) => set.contains(&id),
None => (id as usize) < self.vocab_size,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve cl100k special token IDs during sanitize

For built-in tokenizers (known_token_ids == None), is_known_token_id now accepts only id < vocab_size, which incorrectly rejects valid cl100k special IDs like <|endoftext|> (100257) and FIM tokens. decode sanitizes before calling tiktoken, so these legitimate IDs are silently dropped, breaking decode round-trips and any logic that expects those special tokens to survive (e.g., EOS/FIM-aware handling). Fresh evidence is that the new test test_builtin_encoding_uses_range_check_for_known_ids explicitly documents this conservative path and acknowledges special IDs above the range.

Useful? React with 👍 / 👎.

@yetone
Copy link
Copy Markdown
Collaborator Author

yetone commented May 14, 2026

Closing as duplicate of #1482, which is the correct fix.

I'm sorry — I missed checking for existing open PRs against this issue before starting work. My approach here was a defensive band-aid (drop unmapped IDs before they reach _decode_native_and_split), but @CatherineSue's PR is semantically correct: it synthesizes the 256 reserved <|reserved_token_N|> slots upfront to match upstream tokenization_kimi.py, and fixes the Han-aware pre-tokenizer regex. Dropping the IDs the way I did would silently swallow legitimate reserved tokens that the Kimi runtime actually emits.

Please proceed with #1482. Deleting this branch.

@yetone yetone closed this May 14, 2026
@yetone yetone deleted the fix/tiktoken-unmapped-token-decode branch May 14, 2026 22:11
Comment on lines +1184 to +1202
#[test]
fn test_decode_invalid_utf8_with_unmapped_id_does_not_panic() {
// The fallback path (`_decode_native_and_split`) is the actual
// panicking call. Construct an input that *also* contains an unmapped
// ID to make sure sanitation runs before the fallback, not just
// before the happy path. We can't easily force invalid UTF-8 from
// single-byte ASCII fixtures, so use cl100k_base for the UTF-8
// invalidity. With a built-in encoder, the dense-range check still
// catches obviously-out-of-range IDs; combined with `u32::MAX` we
// exercise both behaviors at once.
let tokenizer = TiktokenTokenizer::new(TiktokenModel::Cl100kBase).unwrap();
let full = tokenizer.encode("😀", false).unwrap();
let mut ids = full.token_ids()[..1].to_vec();
ids.push(u32::MAX); // unmapped, must be dropped
let result = tokenizer.decode(&ids, false);
assert!(
result.is_ok(),
"decode with invalid-UTF-8 prefix + unmapped id must not panic: {result:?}"
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This test likely never reaches the lossy-decode fallback it claims to exercise. In cl100k_base, "😀" encodes as a single token, so full.token_ids()[..1] is the complete encoding — decode succeeds on the happy path and the Err branch with _decode_native_and_split is never entered.

To actually hit the fallback + sanitization together, you'd need a partial multi-byte sequence. One way: encode a two-byte character and take only part of its byte-level representation, or hand-craft raw token IDs whose byte concatenation is invalid UTF-8. For example, with the synthetic sparse fixture you could construct a sequence whose byte output is an incomplete UTF-8 lead byte — but that's tricky with single-byte-per-rank fixtures.

As-is the test still validates that u32::MAX gets sanitized on the happy path, which has value, but the name and comment overstate coverage. Consider renaming or adding a dedicated test that reliably triggers the fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tiktoken decode panics on unmapped Kimi K2.5 reserved token IDs

1 participant