fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates by key4ng · Pull Request #1448 · lightseekorg/smg

key4ng · 2026-05-05T16:28:03Z

Description

Problem

Kimi-K2.5's chat_template.jinja branches on a tools_ts_str variable that the model's custom Python tokenizer (tokenization_kimi.py + tool_declaration_ts.py) injects via apply_chat_template. SMG's Rust path renders the template directly with minijinja and never executes the Python preprocessor, so tools_ts_str is undefined and the template falls into the JSON fallback. Tools are emitted as raw JSON instead of the trained TypeScript-namespace format → ~1.7 pp BFCL drop on simple_python.

Solution

Hand-port the Python TS encoder to Rust and wire it into the Tiktoken loader behind a new renderer variant. Detection is driven by config.json::architectures == "KimiK25ForConditionalGeneration", so other models (including Kimi-K2-Instruct and Kimi-K2-Thinking) are unaffected — they don't ship tool_declaration_ts.py and don't reference tools_ts_str in their templates.

Changes

crates/tokenizer/src/encoders/kimi_k25_tools.rs: new TS-namespace encoder + apply_kimi_k25_tools wrapper that computes tools_ts_str and merges it into template_kwargs before delegating to the existing minijinja render.
crates/tokenizer/src/tiktoken.rs: new Renderer enum (Jinja / KimiK25Tools) with architecture-based detection, mirroring the pattern in huggingface.rs.
crates/tokenizer/tests/kimi_k25.rs: consolidated test suite — encoder coverage for nested objects + optional fields, enum/union encoding, and an end-to-end render that proves dispatch wiring (transitively covers detection).
crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja: vendored chat template for the e2e test.

Test Plan

cargo test -p llm-tokenizer --test kimi_k25 → 3 tests pass.
cargo clippy -p llm-tokenizer --tests --all-features -- -D warnings clean.
cargo +nightly fmt -p llm-tokenizer -- --check clean.
Smoke test against a live Kimi-K2.5 deployment with RUST_LOG=tokenizer=debug: logs show selected KimiK25Tools chat-template renderer at startup; rendered prompts contain # Tools\n\n## functions\nnamespace functions {.
BFCL eval on Kimi-K2.5: expect ~+1.7 pp recovery on simple_python vs. unfixed baseline. Numbers will be attached as a comment.

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Added Kimi-K2.5 chat template rendering with injected tool declaration output.
- Generates TypeScript definitions from tool/function schemas for use in templates.
Tests
- Added end-to-end and unit tests covering TypeScript generation, schema shapes (objects, enums, arrays), and template rendering behavior.

coderabbitai · 2026-05-05T16:28:11Z

📝 Walkthrough

Walkthrough

Adds a Kimi-K2.5 TypeScript tool-declaration encoder, detects a renderer from tokenizer config, routes TiktokenTokenizer to the new encoder or Jinja template accordingly, and includes a chat-template fixture plus unit and end-to-end tests.

Changes

Kimi-K2.5 Tool Encoder Integration

Layer / File(s)	Summary
Module Export & Entrypoints `crates/tokenizer/src/encoders/mod.rs`, `crates/tokenizer/src/encoders/kimi_k25_tools.rs`	Adds `pub mod kimi_k25_tools` and public encoder entrypoints (`encode_tools_to_typescript`) plus crate-private `apply_kimi_k25_tools`.
Tokenizer Renderer Selection `crates/tokenizer/src/tiktoken.rs`	Adds private `Renderer` enum, `detect_renderer_from_config()`, stores renderer in `TiktokenTokenizer`, and branches `apply_chat_template()` to call either Jinja flow or `apply_kimi_k25_tools`.
Schema Registry & Parser Core `crates/tokenizer/src/encoders/kimi_k25_tools.rs`	Implements `SchemaRegistry`, $ref resolution (`#/` and `#/$defs/...`), recursion depth guard, and main parse entry for parameter schemas.
Types, Properties & Helpers `crates/tokenizer/src/encoders/kimi_k25_tools.rs`	Implements ParameterType variants, BaseType helpers, scalar mapping, object property rendering (required/optional/defaults), `additionalProperties` semantics, arrays, enums/anyOf/unions, and docstring/JSON-inline helpers.
Function-level TS Generation `crates/tokenizer/src/encoders/kimi_k25_tools.rs`	Emits `$defs` interfaces, optional root `parameters` interface for self-references, function `type` alias, and function-level description/comment blocks.
Chat Template Fixture `crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja`	New Jinja template with macros for content, role markers, tool-call serialization, history splitting, `tool_declare` block, and generation prompt handling.
Tests `crates/tokenizer/tests/kimi_k25.rs`	Unit tests for nested-object and enum TS output and an end-to-end test loading a tokenizer with KimiK25ForConditionalGeneration and asserting TypeScript namespace output.

Sequence Diagram

sequenceDiagram
    participant Client
    participant TiktokenTokenizer
    participant ConfigDetection
    participant KimiEncoder as KimiK25Tools
    participant SchemaRegistry
    participant TemplateEngine

    Client->>TiktokenTokenizer: apply_chat_template(messages, params)
    TiktokenTokenizer->>ConfigDetection: detect_renderer_from_config(dir)
    ConfigDetection-->>TiktokenTokenizer: Renderer::KimiK25Tools
    TiktokenTokenizer->>KimiEncoder: apply_kimi_k25_tools(chat_template, messages, params)
    KimiEncoder->>KimiEncoder: encode_tools_to_typescript(tools)
    KimiEncoder->>SchemaRegistry: parse JSON schemas / $defs
    SchemaRegistry-->>KimiEncoder: ParameterType nodes
    KimiEncoder-->>TiktokenTokenizer: tools_ts_str
    TiktokenTokenizer->>TemplateEngine: apply(chat_template, messages, template_kwargs + tools_ts_str)
    TemplateEngine-->>TiktokenTokenizer: rendered output
    TiktokenTokenizer-->>Client: rendered chat with TS tools

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

lightseekorg/smg#918: Both PRs modify TiktokenTokenizer's apply_chat_template flow with renderer-based branching to call tool-specific encoders.
lightseekorg/smg#439: Related changes that centralized chat-template APIs and extend tokenizer/template interactions used by this flow.
lightseekorg/smg#914: Both PRs touch ChatTemplateParams integration and injection of template context used by the tokenizer.

Suggested reviewers

slin1237
CatherineSue
whybeyoung

"🐰 I hop through schemas, line by line,
Converting defs to TypeScript so fine,
Namespaces hum and templates sing,
Tools take shape—oh what joy they bring!
A tiny rabbit cheers this code's design."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.03% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates' directly corresponds to the main change: implementing TypeScript tool encoding for Kimi-K2.5 templates in the Rust tokenizer and wiring it into the rendering path.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch keyang/kimi_tool_parser

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude

Clean, well-structured PR that follows the established Renderer enum pattern. Encoder is correct, dispatch wiring is sound, and golden-fixture tests ensure byte-equality with the Python reference. No issues found.

gemini-code-assist

Code Review

This pull request implements a specialized tool-declaration encoder for Kimi-K2.5 models, enabling the conversion of OpenAI-style tool definitions into the TypeScript-namespace format required by the model. The TiktokenTokenizer has been updated to support architecture-based renderer detection and dispatch. Review feedback identified a potential stack overflow vulnerability due to the lack of a recursion depth limit in the encoder and suggested an optimization to avoid unnecessary allocations during architecture string matching.

gemini-code-assist · 2026-05-05T16:34:51Z

+    }
+}
+
+fn parse_parameter_type(schema: &Value, registry: &mut SchemaRegistry) -> ParameterType {


The tool-declaration encoder performs recursive descent parsing without a depth limit. Although the design specification (docs/superpowers/specs/2026-05-05-kimi-k25-tool-format-design.md, line 133) explicitly mentions a recursion budget of 32 levels, it is not implemented in the code. A deeply nested or malicious tool schema could cause a stack overflow during both the parsing and rendering phases. Please implement a recursion depth check in parse_parameter_type and its recursive callers, falling back to a safe type (e.g., any) when the limit is reached. Additionally, ensure that protocol-surface types are decorated with #[serde(deny_unknown_fields)] to surface spec divergences, and that tool names are JSON-escaped before interpolation.

References

Protocol-surface types should be decorated with #[serde(deny_unknown_fields)] to ensure that spec divergences are surfaced as errors.

Tool names should be JSON-escaped before interpolation to correctly handle special characters like quotes, backslashes, and unicode.

Investigation: The depth-limit half of this comment is addressed in the current code:

const MAX_RECURSION_DEPTH: usize = 32; fn parse_parameter_type(schema: &Value, registry: &mut SchemaRegistry) -> ParameterType { if registry.depth >= MAX_RECURSION_DEPTH { return ParameterType::Scalar(ParameterTypeScalar::any()); } registry.depth += 1; let result = parse_parameter_type_inner(schema, registry); registry.depth -= 1; ... }

MAX_RECURSION_DEPTH = 32 matches the spec budget. Adversarial schemas degrade to any instead of stack-overflowing. ✅

The other two items in this comment still apply:

#[serde(deny_unknown_fields)] on protocol-surface types — I didn't see any of the schema-shape structs (ParameterTypeObject, ParameterTypeArray, etc.) tagged with this. They're parsed via serde_json::Value lookups rather than typed deserialization, so deny_unknown_fields doesn't directly apply, but if you ever switch to typed deserialization for the function-call surface, add it then.

Tool names should be JSON-escaped before interpolation. I didn't spot escaping at the namespace-emit site. A function named e.g. say "hi" or one with a backslash would produce invalid TypeScript / Markdown. Worth a quick audit. cc @key4ng

gemini-code-assist · 2026-05-05T16:34:51Z

+    let arch_strs: Vec<&str> = architectures
+        .map(|a| a.iter().filter_map(|v| v.as_str()).collect())
+        .unwrap_or_default();
+    if arch_strs.contains(&"KimiK25ForConditionalGeneration") {


This block allocates a temporary Vec<&str> just to check for the presence of a specific architecture string. You can avoid this allocation by using any directly on the iterator. This is more efficient and aligns with repository practices for refactoring logic to avoid duplication and unnecessary intermediate allocations.

Suggested change

let arch_strs: Vec<&str> = architectures

.map(|a| a.iter().filter_map(|v| v.as_str()).collect())

.unwrap_or_default();

if arch_strs.contains(&"KimiK25ForConditionalGeneration") {

let is_kimi = architectures.is_some_and(|a| {

a.iter().any(|v| v.as_str() == Some("KimiK25ForConditionalGeneration"))

});

if is_kimi {

References

Refactor match statements to avoid duplication. When arms have common logic, use the match to return the differing value and perform the common logic once.

When computing hashes from a slice of primitives (e.g., [u32]), avoid unnecessary intermediate allocations like collecting all bytes into a Vec. Instead, use the streaming interface of the hashing library (e.g., xxhash-rust's Xxh3::with_seed) to write bytes incrementally, improving performance.

gemini-code-assist · 2026-05-06T06:17:06Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Signed-off-by: key4ng <[email protected]>

…mi-K2.5 encoder Signed-off-by: key4ng <[email protected]>

…-K2.5 Signed-off-by: key4ng <[email protected]>

Signed-off-by: key4ng <[email protected]>

…imi-K2.5 Signed-off-by: key4ng <[email protected]>

Signed-off-by: key4ng <[email protected]>

… test Signed-off-by: key4ng <[email protected]>

Signed-off-by: key4ng <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99c9c8158b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs`:
- Around line 174-183: The resolve_ref method currently only handles "#" and
"#/$defs/" and silently returns None for other $ref patterns; update resolve_ref
(and related state like has_self_ref and the definitions map) to (1) recognize
the legacy "#/definitions/" prefix by returning
self.definitions.get(name).cloned() for that prefix and (2) add a debug/trace
log when an unrecognized reference is encountered (include the raw reference
string) so callers (e.g., ParameterTypeRef emission) can trace why a ref fell
back to "any"; locate the function resolve_ref and add the logging path before
returning None, and reuse the existing logger/tracing facility or add a simple
trace!()/log::debug!() call consistent with the crate.
- Line 7: The file currently uses a module-level #![allow(clippy::unwrap_used)]
which suppresses unwrap warnings everywhere; narrow this by removing the
module-level attribute and instead add #[allow(clippy::unwrap_used)] only around
the safe, known-to-never-fail write! usages (the blocks calling write! into a
String) or annotate the specific functions that contain those write! calls, and
replace unsafe unwrap patterns used for schema parsing (notably the
as_str().unwrap_or("") usage) with explicit handling: either use
Option::map_or("", |s| s) or return/propagate a Result so parsing paths handle
missing values instead of unwrapping. Ensure annotations reference the exact
functions/blocks with write! and the code calling as_str() so linting remains
enabled elsewhere.
- Around line 54-66: The current match on (params.template_kwargs,
ts_str.as_ref()) drops caller-provided template kwargs when ts_str is None by
returning None; update the match so that when params.template_kwargs is
Some(existing) and ts_str is None you return Some(existing.clone()) instead of
None (keep the existing branches for (Some, Some) and (None, Some) and leave
(None, None) -> None), ensuring the variable owned preserves existing kwargs
when ts_str is absent.

In `@crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja`:
- Line 40: Fix the typo in the Jinja comment that reads "{# Find last
non-tool-call assisitant message #}" by changing "assisitant" to "assistant" so
the comment becomes "{# Find last non-tool-call assistant message #}"—update the
comment text in the chat_template.jinja file where that exact comment appears.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: eabf009d-5581-4f18-a421-3e9cc9be3898

📥 Commits

Reviewing files that changed from the base of the PR and between 0a91a7e and 99c9c81.

📒 Files selected for processing (5)

crates/tokenizer/src/encoders/kimi_k25_tools.rs
crates/tokenizer/src/encoders/mod.rs
crates/tokenizer/src/tiktoken.rs
crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja
crates/tokenizer/tests/kimi_k25.rs

coderabbitai · 2026-05-06T06:24:24Z

+    fn resolve_ref(&mut self, reference: &str) -> Option<Value> {
+        if reference == "#" {
+            self.has_self_ref = true;
+            return Some(serde_json::json!({"$self_ref": true}));
+        }
+        if let Some(name) = reference.strip_prefix("#/$defs/") {
+            return self.definitions.get(name).cloned();
+        }
+        None
+    }


🧹 Nitpick | 🔵 Trivial | 💤 Low value

Limited $ref pattern support may silently degrade schemas.

resolve_ref only handles "#" and "#/$defs/" patterns. Other valid JSON Schema references (e.g., "#/definitions/" used by older OpenAPI specs, or external file refs) return None, causing ParameterTypeRef to emit "any". This is likely intentional to match the Python reference, but consider adding a debug/trace log for unrecognized refs to aid debugging during integration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 174 - 183, The resolve_ref method currently only handles "#" and "#/$defs/" and silently returns None for other $ref patterns; update resolve_ref (and related state like has_self_ref and the definitions map) to (1) recognize the legacy "#/definitions/" prefix by returning self.definitions.get(name).cloned() for that prefix and (2) add a debug/trace log when an unrecognized reference is encountered (include the raw reference string) so callers (e.g., ParameterTypeRef emission) can trace why a ref fell back to "any"; locate the function resolve_ref and add the logging path before returning None, and reuse the existing logger/tracing facility or add a simple trace!()/log::debug!() call consistent with the crate.

CatherineSue · 2026-05-08T06:30:49Z

@@ -0,0 +1,658 @@
+//! Kimi-K2.5 tool-declaration encoder. See module-level docs.
+//!
+//! Mirrors `tool_declaration_ts.py` from the Kimi-K2.5 model snapshot


Can we add a url link to this file?

- Add upstream Kimi-K2.5 reference URL to module docs. - Replace module-level `#![allow(clippy::unwrap_used)]` with narrowly- scoped `#[expect(...)]` on infallible `write!`/`writeln!` sites. - Bound JSON-schema recursion at 32 levels in `parse_parameter_type`; degrade to `any` on overflow to prevent stack overflow on adversarial tool schemas. - Fix `assisitant` typo in chat_template.jinja fixture. Signed-off-by: key4ng <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fcd8e9551

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T22:25:53Z

+        .iter()
+        .map(|(k, v)| (k.clone(), v.clone()))
+        .collect();
+    defs_clone.sort_by(|a, b| a.0.cmp(&b.0));


Preserve $defs order when emitting interfaces

When a Kimi tool schema contains multiple $defs whose declaration order is not alphabetical, this sorts the generated interface blocks by name instead of preserving the order registered from the schema. The upstream Kimi encoder iterates its copied definitions in insertion order, so these prompts stop being byte-equivalent for such schemas; please keep an ordered registry/iteration rather than sorting the cloned definitions.

Useful? React with 👍 / 👎.

Investigation: Verified against upstream moonshotai/Kimi-K2.5/tool_declaration_ts.py — upstream preserves insertion order:

definitions_copy = dict(registry.definitions) for def_name, def_schema in definitions_copy.items(): ...

dict(...).items() is insertion-ordered in Python 3.7+. The PR's defs_clone.sort_by(|a, b| a.0.cmp(&b.0)) breaks byte-equivalence for any tool schema whose $defs keys aren't already alphabetical.

Suggested fix: swap SchemaRegistry::definitions: HashMap<String, Value> for IndexMap<String, Value> (or track a parallel Vec<String> of insertion order) and drop the sort.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs`:
- Around line 112-129: The loop that builds TypeScript definitions currently
always writes "interface {name} {body}", producing invalid TS for non-object
schemas; change the branch after calling parse_parameter_type(...) /
obj_type.to_typescript(...) to detect whether the parsed type is an object
(match on the ParameterType/Type enum variant used by parse_parameter_type,
e.g., ParameterType::Object or similar) and only emit "interface {name} {body}"
for that case; for all other variants emit a type alias like "type {name} =
{body};" (preserve the description handling and writing into def_str and pushing
into interfaces).
- Around line 133-143: The emitted TypeScript uses raw function and property
names which can produce invalid identifiers (e.g., hyphens); sanitize/normalize
names before interpolation: replace/normalize invalid identifier characters in
the computed function_name used in the type alias (variable function_name) to
produce a valid ECMAScript identifier (e.g., replace non-[A-Za-z0-9_$] with
underscore and ensure it doesn’t start with a digit, or prefix with `_`). For
object/parameter keys coming from parsed.to_typescript(...) (used when
root_interface_name is None and around the parameter emission at lines
~346-350), ensure property keys are emitted quoted if they are not valid
identifiers—either update parsed.to_typescript to quote keys or post-process its
output to wrap any non-identifier keys in quotes. Apply these changes where
function_name is created and where parsed.to_typescript is used so both type
alias names and object property keys are valid TypeScript.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bebd994c-8f2e-4119-8874-5c85c10e67f6

📥 Commits

Reviewing files that changed from the base of the PR and between 99c9c81 and 6fcd8e9.

📒 Files selected for processing (2)

crates/tokenizer/src/encoders/kimi_k25_tools.rs
crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja

coderabbitai · 2026-05-11T22:29:05Z

+    for (name, schema) in defs_clone {
+        let obj_type = parse_parameter_type(&schema, &mut registry);
+        let body = obj_type.to_typescript("", &registry);
+        let mut def_str = String::new();
+        if let Some(desc) = schema.get("description").and_then(Value::as_str) {
+            if !desc.is_empty() {
+                def_str.push_str(&format_description(desc, ""));
+                def_str.push('\n');
+            }
+        }
+        #[expect(
+            clippy::unwrap_used,
+            reason = "write!/writeln! into String cannot fail"
+        )]
+        {
+            write!(def_str, "interface {name} {body}").unwrap();
+        }
+        interfaces.push(def_str);


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Is interface Color "red" | "blue"valid TypeScript syntax, or must non-object schema definitions be emitted astype Color = ...?

💡 Result:

interface Color "red" | "blue" is not valid TypeScript syntax. You must use a type alias for a union of non-object types. In TypeScript, interface is for describing object shapes, while unions are expressed via type aliases (e.g. type Color = "red" | "blue"). Interfaces can include union types only as property types, not as the interface’s own definition. Example: - Valid: type Color = "red" | "blue" [1][2] - Valid (union inside an interface property): interface Paint { color: "red" | "blue" } [2] - Not valid: interface Color = A | B (union at the top level requires type) [3] So if your schema generation emits something like interface Color "red" | "blue", it should instead emit type Color = "red" | "blue". [4]

Citations:

1: https://typescriptlang.org/docs/handbook/unions-and-intersections.html

2: https://www.typescriptlang.org/docs/handbook/literal-types.html

3: https://stackoverflow.com/questions/42156270/create-typescript-interface-as-union-of-other-interfaces

4: https://www.typescriptlang.org/docs/handbook/advanced-types.html

🏁 Script executed:

# Search for ParameterType definition rg "enum ParameterType|struct ParameterType|type ParameterType" -A 10 # Also search for where Object variant is used rg "ParameterType::Object" -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 7916

🏁 Script executed:

# Read the context around lines 112-129 sed -n '100,140p' crates/tokenizer/src/encoders/kimi_k25_tools.rs | cat -n

Repository: lightseekorg/smg

Length of output: 1750

Use type aliases for non-object $defs entries.

The code currently renders all schema definitions as interface {name} {body}, which produces invalid TypeScript for non-object types (enums, scalars, arrays, unions). For example, interface Color "red" | "blue" or interface Tags Array<string> are invalid syntax. TypeScript requires non-object types to use type aliases instead.

Update lines 112–129 to emit interface only for Object variants and type aliases for all others:

Suggested fix

for (name, schema) in defs_clone { let obj_type = parse_parameter_type(&schema, &mut registry); let body = obj_type.to_typescript("", &registry); let mut def_str = String::new(); if let Some(desc) = schema.get("description").and_then(Value::as_str) { if !desc.is_empty() { def_str.push_str(&format_description(desc, "")); def_str.push('\n'); } } #[expect( clippy::unwrap_used, reason = "write!/writeln! into String cannot fail" )] { - write!(def_str, "interface {name} {body}").unwrap(); + if matches!(&obj_type, ParameterType::Object(_)) { + write!(def_str, "interface {name} {body}").unwrap(); + } else { + write!(def_str, "type {name} = {body};").unwrap(); + } } interfaces.push(def_str); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 112 - 129, The loop that builds TypeScript definitions currently always writes "interface {name} {body}", producing invalid TS for non-object schemas; change the branch after calling parse_parameter_type(...) / obj_type.to_typescript(...) to detect whether the parsed type is an object (match on the ParameterType/Type enum variant used by parse_parameter_type, e.g., ParameterType::Object or similar) and only emit "interface {name} {body}" for that case; for all other variants emit a type alias like "type {name} = {body};" (preserve the description handling and writing into def_str and pushing into interfaces).

coderabbitai · 2026-05-11T22:29:05Z

+    let function_name = function
+        .get("name")
+        .and_then(Value::as_str)
+        .unwrap_or("function");
+    let type_def = match root_interface_name {
+        Some(n) => format!("type {function_name} = (_: {n}) => any;"),
+        None => format!(
+            "type {function_name} = (_: {}) => any;",
+            parsed.to_typescript("", &registry)
+        ),
+    };


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In TypeScript, do object property names containing -or spaces need to be quoted, and can a type alias name likeget-user be emitted without normalization?

💡 Result:

Yes, object property names containing - or spaces must be quoted in TypeScript object literals, as they are invalid JavaScript identifiers.[1][2][3][4][5] Unquoted, the parser interprets - as subtraction or spaces as invalid token separators, causing syntax errors.[6][4] No, a type alias name like get-user cannot be emitted without normalization (i.e., cannot be used as-is). Type alias names must be valid ECMAScript Identifiers, which cannot contain - or spaces; hyphens and spaces are not allowed in unquoted identifiers per ECMAScript grammar.[7][8][9][10] Attempting type get-user = string; results in a syntax error, as get-user is not a valid IdentifierName. TypeScript follows these rules for declaration parsing and emit (type aliases are erased in JS output but invalid syntax prevents compilation).[11][12][10][13]

Citations:

1: Better handlng of hyphenated properties microsoft/TypeScript#4032

2: https://www.typescripts.org/docs/handbook/2/objects.html

3: https://www.typescriptlang.org/docs/handbook/interfaces

4: https://stackoverflow.com/questions/38441719/how-can-i-construct-an-object-with-keys-that-contain-hyphens

5: https://designcise.com/web/tutorial/can-javascript-object-property-name-have-dashes

6: https://stackoverflow.com/questions/7122609/how-do-i-reference-a-javascript-object-property-with-a-hyphen-in-it

7: JSDoc Property Name with Dash Not Recognized microsoft/TypeScript#14395

8: Typescript supports less valid variable names characters than javascript microsoft/TypeScript#42452

9: https://tc39.es/ecma262/multipage/ecmascript-language-lexical-grammar.html

10: https://www.typescriptlang.org/docs/handbook/2/everyday-types.html

11: Better error message for "Cannot find name" when looking for a type microsoft/TypeScript#7900

12: https://www.typescriptlang.org/docs/handbook/2/objects.html

13: https://github.com/microsoft/TypeScript/blob/master/src/compiler/emitter.ts

🏁 Script executed:

wc -l crates/tokenizer/src/encoders/kimi_k25_tools.rs

Repository: lightseekorg/smg

Length of output: 112

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '125,155p'

Repository: lightseekorg/smg

Length of output: 1158

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '340,360p'

Repository: lightseekorg/smg

Length of output: 725

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '1,50p'

Repository: lightseekorg/smg

Length of output: 2084

🏁 Script executed:

grep -n "struct Parameter" crates/tokenizer/src/encoders/kimi_k25_tools.rs -A 5

Repository: lightseekorg/smg

Length of output: 1095

🏁 Script executed:

grep -n "impl.*BaseType\|fn to_typescript" crates/tokenizer/src/encoders/kimi_k25_tools.rs | head -20

Repository: lightseekorg/smg

Length of output: 681

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '200,250p'

Repository: lightseekorg/smg

Length of output: 2328

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '358,425p'

Repository: lightseekorg/smg

Length of output: 2975

🏁 Script executed:

grep -n "fn from_json\|Parameter {" crates/tokenizer/src/encoders/kimi_k25_tools.rs | head -20

Repository: lightseekorg/smg

Length of output: 144

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '620,680p'

Repository: lightseekorg/smg

Length of output: 2599

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '315,356p'

Repository: lightseekorg/smg

Length of output: 1648

🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '420,450p'

Repository: lightseekorg/smg

Length of output: 1651

🏁 Script executed:

grep -rn "get-user\|foo-bar\|-" crates/tokenizer/tests/ | grep -i kimi | head -20

Repository: lightseekorg/smg

Length of output: 2022

🏁 Script executed:

find . -name "*kimi*test*" -o -name "*test*kimi*" 2>/dev/null

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

cat -n crates/tokenizer/tests/kimi_k25.rs 2>/dev/null | head -50

Repository: lightseekorg/smg

Length of output: 2259

🏁 Script executed:

cat -n crates/tokenizer/tests/kimi_k25.rs | sed -n '50,150p'

Repository: lightseekorg/smg

Length of output: 3258

🏁 Script executed:

grep -n "hyphen\|-user\|foo-bar" crates/tokenizer/tests/kimi_k25.rs

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

tail -100 crates/tokenizer/tests/kimi_k25.rs

Repository: lightseekorg/smg

Length of output: 3397

Sanitize function and property names before emitting TypeScript code.

Function names (line 138, 140) and parameter names (line 349) are interpolated without sanitization. Names like get-user or foo-bar produce invalid TypeScript (type get-user = ... or foo-bar: string). Type alias names must be valid ECMAScript identifiers (hyphens not allowed), and object property names with hyphens must be quoted. Normalize or alias function names, and quote property keys in the output.

Also applies to: 346-350

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 133 - 143, The emitted TypeScript uses raw function and property names which can produce invalid identifiers (e.g., hyphens); sanitize/normalize names before interpolation: replace/normalize invalid identifier characters in the computed function_name used in the type alias (variable function_name) to produce a valid ECMAScript identifier (e.g., replace non-[A-Za-z0-9_$] with underscore and ensure it doesn’t start with a digit, or prefix with `_`). For object/parameter keys coming from parsed.to_typescript(...) (used when root_interface_name is None and around the parameter emission at lines ~346-350), ensure property keys are emitted quoted if they are not valid identifiers—either update parsed.to_typescript to quote keys or post-process its output to wrap any non-identifier keys in quotes. Apply these changes where function_name is created and where parsed.to_typescript is used so both type alias names and object property keys are valid TypeScript.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7171022e63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T02:22:45Z

+                "parameters".to_string()
+            }
+            Some(_) => reference.rsplit('/').next().unwrap_or("").to_string(),
+            None => "any".to_string(),


Fail unresolved $ref instead of silently emitting any

When a tool schema contains an unresolved or unsupported $ref, this code converts it to any and still emits tools_ts_str, which keeps the template on the TypeScript branch with lossy declarations. In Kimi’s Python tokenizer, unresolved refs raise during conversion and apply_chat_template falls back to the JSON tool declaration path, so this behavior diverges in exactly the cases where schema fidelity is already fragile and can degrade tool-call correctness.

Useful? React with 👍 / 👎.

Investigation: Verified against upstream — resolve_ref raises:

def resolve_ref(self, ref: str) -> dict[str, Any]: if ref == "#": self.has_self_ref = True return {"$self_ref": True} elif ref.startswith("#/$defs/"): def_name = ref.split("/")[-1] if def_name not in self.definitions: raise ValueError(f"Reference not found: {ref}") return self.definitions[def_name] else: raise ValueError(f"Unsupported reference format: {ref}")

There is no fallback to None or "any". The Python apply_chat_template catches this and drops to the JSON tool-declaration path, so byte-equivalence with upstream is violated in exactly the cases where TS-namespace fidelity is already broken.

The current Rust path:

resolve_ref returns None for unknown / unsupported prefixes

ParameterTypeRef::parse maps that to "any" and stays on the TS branch with lossy types

Trade-off: raising in apply_chat_template requires the caller (server) to know how to fall back to JSON. Cheapest improvement that preserves observability: emit tracing::warn! when this happens so the silent degradation is visible. Better: propagate Err and let apply_chat_template decide the fallback policy. cc @key4ng

chatgpt-codex-connector · 2026-05-12T02:22:45Z

+    // Fallthrough: schemas with no type/anyOf/enum/$ref. Degrade to `any`
+    // permissively rather than erroring — matches the Python reference.
+    ParameterType::Scalar(ParameterTypeScalar::any())


Stop coercing unsupported schema objects to any

The fallback here turns any non-empty schema object without type/enum/anyOf/$ref into any (for example schemas using oneOf, allOf, or const), so conversion succeeds and the template stays on the TypeScript path with dropped constraints. Upstream Kimi logic raises on these invalid/unsupported shapes and then falls back to JSON tool declarations, so this permissive coercion produces a different prompt format in real schemas that should not be silently approximated.

Useful? React with 👍 / 👎.

Investigation: Verified — the in-code comment is misleading. Upstream raises:

def _parse_parameter_type(json_schema_object, registry=None): # ... checks for $ref, anyOf, enum, type ... elif json_schema_object == {}: return _ParameterTypeScalar(type="any") else: raise ValueError(f"Invalid JSON Schema object: {json_schema_object}")

So upstream only coerces an empty object {} to any. Schemas with oneOf / allOf / const / any other non-empty content raise, and apply_chat_template then falls back to JSON tool declarations.

The Rust code:

if obj.is_empty() { return ParameterType::Scalar(ParameterTypeScalar::any()); } // Fallthrough: schemas with no type/anyOf/enum/$ref. Degrade to `any` // permissively rather than erroring — matches the Python reference. ParameterType::Scalar(ParameterTypeScalar::any())

The empty-object branch matches upstream. The fallthrough does not — the inline comment "matches the Python reference" is incorrect.

Suggested minimum: at least fix the misleading comment + tracing::warn! when this path fires. Better: return Err and let apply_chat_template decide JSON-fallback policy (same trade-off as the unresolved-$ref comment above). cc @key4ng

CatherineSue · 2026-05-13T00:56:06Z

@@ -0,0 +1,131 @@
+//! Essential coverage for the Kimi-K2.5 tool renderer:


Let's rename this one to kimi_k25_renderer_detection.rs?

CatherineSue · 2026-05-13T01:26:32Z

+/// use. Missing / unreadable / malformed config falls back to `Renderer::Jinja`
+/// silently with a debug log, mirroring `huggingface.rs::detect_renderer_from_config`.
+fn detect_renderer_from_config(dir: &Path) -> Renderer {
+    let path = dir.join("config.json");


Maybe we extract a small helper architectures_from_config and make both detect_renderer_from_config call from it so we can reduce some of the code duplication here. Your call. It is not a big duplication now.

github-actions Bot added documentation Improvements or additions to documentation tokenizer Tokenizer related changes tests Test changes labels May 5, 2026

claude Bot approved these changes May 5, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

key4ng closed this May 5, 2026

key4ng reopened this May 6, 2026

key4ng marked this pull request as ready for review May 6, 2026 06:17

key4ng requested review from CatherineSue and slin1237 as code owners May 6, 2026 06:17

key4ng added 16 commits May 5, 2026 23:17

docs(spec): add Kimi-K2.5 tool-format fix design

b97e7bd

Signed-off-by: key4ng <[email protected]>

docs(spec): drop defensive HF branch from Kimi-K2.5 fix (YAGNI)

5cfc616

Signed-off-by: key4ng <[email protected]>

feat(tokenizer): add kimi_k25_tools module skeleton

b2d2790

Signed-off-by: key4ng <[email protected]>

test(tokenizer): add Kimi-K2.5 encoder fixture infrastructure

509e702

Signed-off-by: key4ng <[email protected]>

feat(tokenizer): port Kimi-K2.5 encoder for scalars/objects/arrays

86408f1

Signed-off-by: key4ng <[email protected]>

fix(tokenizer): correct array item indent and bool default repr in Ki…

59b08da

…mi-K2.5 encoder Signed-off-by: key4ng <[email protected]>

feat(tokenizer): add union encoding (enum, anyOf, type-list) for Kimi…

c4392d5

…-K2.5 Signed-off-by: key4ng <[email protected]>

feat(tokenizer): add $ref support and description fixtures for Kimi-K2.5

6edc522

Signed-off-by: key4ng <[email protected]>

test(tokenizer): cover empty-params and unknown-type edge cases for K…

6715962

…imi-K2.5 Signed-off-by: key4ng <[email protected]>

feat(tokenizer): implement apply_kimi_k25_tools renderer wrapper

4eb2bfc

Signed-off-by: key4ng <[email protected]>

feat(tokenizer): add Renderer enum and detection to tiktoken

5b9104a

Signed-off-by: key4ng <[email protected]>

feat(tokenizer): wire Kimi-K2.5 dispatch in tiktoken; add integration…

ed0f623

… test Signed-off-by: key4ng <[email protected]>

test(tokenizer): prune Kimi-K2.5 fixtures to essential coverage

cc9cf53

Signed-off-by: key4ng <[email protected]>

fix(tokenizer): correctness polish from final code review (Kimi-K2.5)

3cb28da

Signed-off-by: key4ng <[email protected]>

test(tokenizer): consolidate Kimi-K2.5 tests, drop design doc

5ff646a

Signed-off-by: key4ng <[email protected]>

refactor(tokenizer): drop intermediate Vec in Kimi-K2.5 arch detection

99c9c81

Signed-off-by: key4ng <[email protected]>

key4ng force-pushed the keyang/kimi_tool_parser branch from bdb113c to 99c9c81 Compare May 6, 2026 06:17

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs

coderabbitai Bot requested changes May 6, 2026

View reviewed changes

github-actions Bot removed the documentation Improvements or additions to documentation label May 6, 2026

CatherineSue reviewed May 8, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request May 8, 2026

📝 CodeRabbit Chat: Implement requested code changes #1466

Closed

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

coderabbitai Bot requested changes May 11, 2026

View reviewed changes

Merge branch 'main' into keyang/kimi_tool_parser

7171022

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

CatherineSue reviewed May 13, 2026

View reviewed changes

		@@ -0,0 +1,131 @@
		//! Essential coverage for the Kimi-K2.5 tool renderer:

Conversation

key4ng commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

CatherineSue May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CatherineSue May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

CatherineSue May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

key4ng commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading