Skip to content

fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates#1448

Open
key4ng wants to merge 18 commits into
mainfrom
keyang/kimi_tool_parser
Open

fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates#1448
key4ng wants to merge 18 commits into
mainfrom
keyang/kimi_tool_parser

Conversation

@key4ng
Copy link
Copy Markdown
Collaborator

@key4ng key4ng commented May 5, 2026

Description

Problem

Kimi-K2.5's chat_template.jinja branches on a tools_ts_str variable that the model's custom Python tokenizer (tokenization_kimi.py + tool_declaration_ts.py) injects via apply_chat_template. SMG's Rust path renders the template directly with minijinja and never executes the Python preprocessor, so tools_ts_str is undefined and the template falls into the JSON fallback. Tools are emitted as raw JSON instead of the trained TypeScript-namespace format → ~1.7 pp BFCL drop on simple_python.

Solution

Hand-port the Python TS encoder to Rust and wire it into the Tiktoken loader behind a new renderer variant. Detection is driven by config.json::architectures == "KimiK25ForConditionalGeneration", so other models (including Kimi-K2-Instruct and Kimi-K2-Thinking) are unaffected — they don't ship tool_declaration_ts.py and don't reference tools_ts_str in their templates.

Changes

  • crates/tokenizer/src/encoders/kimi_k25_tools.rs: new TS-namespace encoder + apply_kimi_k25_tools wrapper that computes tools_ts_str and merges it into template_kwargs before delegating to the existing minijinja render.
  • crates/tokenizer/src/tiktoken.rs: new Renderer enum (Jinja / KimiK25Tools) with architecture-based detection, mirroring the pattern in huggingface.rs.
  • crates/tokenizer/tests/kimi_k25.rs: consolidated test suite — encoder coverage for nested objects + optional fields, enum/union encoding, and an end-to-end render that proves dispatch wiring (transitively covers detection).
  • crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja: vendored chat template for the e2e test.

Test Plan

  • cargo test -p llm-tokenizer --test kimi_k25 → 3 tests pass.
  • cargo clippy -p llm-tokenizer --tests --all-features -- -D warnings clean.
  • cargo +nightly fmt -p llm-tokenizer -- --check clean.
  • Smoke test against a live Kimi-K2.5 deployment with RUST_LOG=tokenizer=debug: logs show selected KimiK25Tools chat-template renderer at startup; rendered prompts contain # Tools\n\n## functions\nnamespace functions {.
  • BFCL eval on Kimi-K2.5: expect ~+1.7 pp recovery on simple_python vs. unfixed baseline. Numbers will be attached as a comment.
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Added Kimi-K2.5 chat template rendering with injected tool declaration output.
    • Generates TypeScript definitions from tool/function schemas for use in templates.
  • Tests

    • Added end-to-end and unit tests covering TypeScript generation, schema shapes (objects, enums, arrays), and template rendering behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

Adds a Kimi-K2.5 TypeScript tool-declaration encoder, detects a renderer from tokenizer config, routes TiktokenTokenizer to the new encoder or Jinja template accordingly, and includes a chat-template fixture plus unit and end-to-end tests.

Changes

Kimi-K2.5 Tool Encoder Integration

Layer / File(s) Summary
Module Export & Entrypoints
crates/tokenizer/src/encoders/mod.rs, crates/tokenizer/src/encoders/kimi_k25_tools.rs
Adds pub mod kimi_k25_tools and public encoder entrypoints (encode_tools_to_typescript) plus crate-private apply_kimi_k25_tools.
Tokenizer Renderer Selection
crates/tokenizer/src/tiktoken.rs
Adds private Renderer enum, detect_renderer_from_config(), stores renderer in TiktokenTokenizer, and branches apply_chat_template() to call either Jinja flow or apply_kimi_k25_tools.
Schema Registry & Parser Core
crates/tokenizer/src/encoders/kimi_k25_tools.rs
Implements SchemaRegistry, $ref resolution (#/ and #/$defs/...), recursion depth guard, and main parse entry for parameter schemas.
Types, Properties & Helpers
crates/tokenizer/src/encoders/kimi_k25_tools.rs
Implements ParameterType variants, BaseType helpers, scalar mapping, object property rendering (required/optional/defaults), additionalProperties semantics, arrays, enums/anyOf/unions, and docstring/JSON-inline helpers.
Function-level TS Generation
crates/tokenizer/src/encoders/kimi_k25_tools.rs
Emits $defs interfaces, optional root parameters interface for self-references, function type alias, and function-level description/comment blocks.
Chat Template Fixture
crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja
New Jinja template with macros for content, role markers, tool-call serialization, history splitting, tool_declare block, and generation prompt handling.
Tests
crates/tokenizer/tests/kimi_k25.rs
Unit tests for nested-object and enum TS output and an end-to-end test loading a tokenizer with KimiK25ForConditionalGeneration and asserting TypeScript namespace output.

Sequence Diagram

sequenceDiagram
    participant Client
    participant TiktokenTokenizer
    participant ConfigDetection
    participant KimiEncoder as KimiK25Tools
    participant SchemaRegistry
    participant TemplateEngine

    Client->>TiktokenTokenizer: apply_chat_template(messages, params)
    TiktokenTokenizer->>ConfigDetection: detect_renderer_from_config(dir)
    ConfigDetection-->>TiktokenTokenizer: Renderer::KimiK25Tools
    TiktokenTokenizer->>KimiEncoder: apply_kimi_k25_tools(chat_template, messages, params)
    KimiEncoder->>KimiEncoder: encode_tools_to_typescript(tools)
    KimiEncoder->>SchemaRegistry: parse JSON schemas / $defs
    SchemaRegistry-->>KimiEncoder: ParameterType nodes
    KimiEncoder-->>TiktokenTokenizer: tools_ts_str
    TiktokenTokenizer->>TemplateEngine: apply(chat_template, messages, template_kwargs + tools_ts_str)
    TemplateEngine-->>TiktokenTokenizer: rendered output
    TiktokenTokenizer-->>Client: rendered chat with TS tools
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • lightseekorg/smg#918: Both PRs modify TiktokenTokenizer's apply_chat_template flow with renderer-based branching to call tool-specific encoders.
  • lightseekorg/smg#439: Related changes that centralized chat-template APIs and extend tokenizer/template interactions used by this flow.
  • lightseekorg/smg#914: Both PRs touch ChatTemplateParams integration and injection of template context used by the tokenizer.

Suggested reviewers

  • slin1237
  • CatherineSue
  • whybeyoung

"🐰 I hop through schemas, line by line,
Converting defs to TypeScript so fine,
Namespaces hum and templates sing,
Tools take shape—oh what joy they bring!
A tiny rabbit cheers this code's design."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates' directly corresponds to the main change: implementing TypeScript tool encoding for Kimi-K2.5 templates in the Rust tokenizer and wiring it into the rendering path.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch keyang/kimi_tool_parser

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added documentation Improvements or additions to documentation tokenizer Tokenizer related changes tests Test changes labels May 5, 2026
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-structured PR that follows the established Renderer enum pattern. Encoder is correct, dispatch wiring is sound, and golden-fixture tests ensure byte-equality with the Python reference. No issues found.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a specialized tool-declaration encoder for Kimi-K2.5 models, enabling the conversion of OpenAI-style tool definitions into the TypeScript-namespace format required by the model. The TiktokenTokenizer has been updated to support architecture-based renderer detection and dispatch. Review feedback identified a potential stack overflow vulnerability due to the lack of a recursion depth limit in the encoder and suggested an optimization to avoid unnecessary allocations during architecture string matching.

}
}

fn parse_parameter_type(schema: &Value, registry: &mut SchemaRegistry) -> ParameterType {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The tool-declaration encoder performs recursive descent parsing without a depth limit. Although the design specification (docs/superpowers/specs/2026-05-05-kimi-k25-tool-format-design.md, line 133) explicitly mentions a recursion budget of 32 levels, it is not implemented in the code. A deeply nested or malicious tool schema could cause a stack overflow during both the parsing and rendering phases. Please implement a recursion depth check in parse_parameter_type and its recursive callers, falling back to a safe type (e.g., any) when the limit is reached. Additionally, ensure that protocol-surface types are decorated with #[serde(deny_unknown_fields)] to surface spec divergences, and that tool names are JSON-escaped before interpolation.

References
  1. Protocol-surface types should be decorated with #[serde(deny_unknown_fields)] to ensure that spec divergences are surfaced as errors.
  2. Tool names should be JSON-escaped before interpolation to correctly handle special characters like quotes, backslashes, and unicode.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigation: The depth-limit half of this comment is addressed in the current code:

const MAX_RECURSION_DEPTH: usize = 32;

fn parse_parameter_type(schema: &Value, registry: &mut SchemaRegistry) -> ParameterType {
    if registry.depth >= MAX_RECURSION_DEPTH {
        return ParameterType::Scalar(ParameterTypeScalar::any());
    }
    registry.depth += 1;
    let result = parse_parameter_type_inner(schema, registry);
    registry.depth -= 1;
    ...
}

MAX_RECURSION_DEPTH = 32 matches the spec budget. Adversarial schemas degrade to any instead of stack-overflowing. ✅

The other two items in this comment still apply:

  1. #[serde(deny_unknown_fields)] on protocol-surface types — I didn't see any of the schema-shape structs (ParameterTypeObject, ParameterTypeArray, etc.) tagged with this. They're parsed via serde_json::Value lookups rather than typed deserialization, so deny_unknown_fields doesn't directly apply, but if you ever switch to typed deserialization for the function-call surface, add it then.
  2. Tool names should be JSON-escaped before interpolation. I didn't spot escaping at the namespace-emit site. A function named e.g. say "hi" or one with a backslash would produce invalid TypeScript / Markdown. Worth a quick audit. cc @key4ng

Comment thread crates/tokenizer/src/tiktoken.rs Outdated
Comment on lines +598 to +601
let arch_strs: Vec<&str> = architectures
.map(|a| a.iter().filter_map(|v| v.as_str()).collect())
.unwrap_or_default();
if arch_strs.contains(&"KimiK25ForConditionalGeneration") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block allocates a temporary Vec<&str> just to check for the presence of a specific architecture string. You can avoid this allocation by using any directly on the iterator. This is more efficient and aligns with repository practices for refactoring logic to avoid duplication and unnecessary intermediate allocations.

Suggested change
let arch_strs: Vec<&str> = architectures
.map(|a| a.iter().filter_map(|v| v.as_str()).collect())
.unwrap_or_default();
if arch_strs.contains(&"KimiK25ForConditionalGeneration") {
let is_kimi = architectures.is_some_and(|a| {
a.iter().any(|v| v.as_str() == Some("KimiK25ForConditionalGeneration"))
});
if is_kimi {
References
  1. Refactor match statements to avoid duplication. When arms have common logic, use the match to return the differing value and perform the common logic once.
  2. When computing hashes from a slice of primitives (e.g., [u32]), avoid unnecessary intermediate allocations like collecting all bytes into a Vec. Instead, use the streaming interface of the hashing library (e.g., xxhash-rust's Xxh3::with_seed) to write bytes incrementally, improving performance.

@key4ng key4ng closed this May 5, 2026
@key4ng key4ng reopened this May 6, 2026
@key4ng key4ng marked this pull request as ready for review May 6, 2026 06:17
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

key4ng added 16 commits May 5, 2026 23:17
@key4ng key4ng force-pushed the keyang/kimi_tool_parser branch from bdb113c to 99c9c81 Compare May 6, 2026 06:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99c9c8158b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs`:
- Around line 174-183: The resolve_ref method currently only handles "#" and
"#/$defs/" and silently returns None for other $ref patterns; update resolve_ref
(and related state like has_self_ref and the definitions map) to (1) recognize
the legacy "#/definitions/" prefix by returning
self.definitions.get(name).cloned() for that prefix and (2) add a debug/trace
log when an unrecognized reference is encountered (include the raw reference
string) so callers (e.g., ParameterTypeRef emission) can trace why a ref fell
back to "any"; locate the function resolve_ref and add the logging path before
returning None, and reuse the existing logger/tracing facility or add a simple
trace!()/log::debug!() call consistent with the crate.
- Line 7: The file currently uses a module-level #![allow(clippy::unwrap_used)]
which suppresses unwrap warnings everywhere; narrow this by removing the
module-level attribute and instead add #[allow(clippy::unwrap_used)] only around
the safe, known-to-never-fail write! usages (the blocks calling write! into a
String) or annotate the specific functions that contain those write! calls, and
replace unsafe unwrap patterns used for schema parsing (notably the
as_str().unwrap_or("") usage) with explicit handling: either use
Option::map_or("", |s| s) or return/propagate a Result so parsing paths handle
missing values instead of unwrapping. Ensure annotations reference the exact
functions/blocks with write! and the code calling as_str() so linting remains
enabled elsewhere.
- Around line 54-66: The current match on (params.template_kwargs,
ts_str.as_ref()) drops caller-provided template kwargs when ts_str is None by
returning None; update the match so that when params.template_kwargs is
Some(existing) and ts_str is None you return Some(existing.clone()) instead of
None (keep the existing branches for (Some, Some) and (None, Some) and leave
(None, None) -> None), ensuring the variable owned preserves existing kwargs
when ts_str is absent.

In `@crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja`:
- Line 40: Fix the typo in the Jinja comment that reads "{# Find last
non-tool-call assisitant message #}" by changing "assisitant" to "assistant" so
the comment becomes "{# Find last non-tool-call assistant message #}"—update the
comment text in the chat_template.jinja file where that exact comment appears.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: eabf009d-5581-4f18-a421-3e9cc9be3898

📥 Commits

Reviewing files that changed from the base of the PR and between 0a91a7e and 99c9c81.

📒 Files selected for processing (5)
  • crates/tokenizer/src/encoders/kimi_k25_tools.rs
  • crates/tokenizer/src/encoders/mod.rs
  • crates/tokenizer/src/tiktoken.rs
  • crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja
  • crates/tokenizer/tests/kimi_k25.rs

Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs Outdated
Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs
Comment on lines +174 to +183
fn resolve_ref(&mut self, reference: &str) -> Option<Value> {
if reference == "#" {
self.has_self_ref = true;
return Some(serde_json::json!({"$self_ref": true}));
}
if let Some(name) = reference.strip_prefix("#/$defs/") {
return self.definitions.get(name).cloned();
}
None
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Limited $ref pattern support may silently degrade schemas.

resolve_ref only handles "#" and "#/$defs/" patterns. Other valid JSON Schema references (e.g., "#/definitions/" used by older OpenAPI specs, or external file refs) return None, causing ParameterTypeRef to emit "any". This is likely intentional to match the Python reference, but consider adding a debug/trace log for unrecognized refs to aid debugging during integration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 174 - 183, The
resolve_ref method currently only handles "#" and "#/$defs/" and silently
returns None for other $ref patterns; update resolve_ref (and related state like
has_self_ref and the definitions map) to (1) recognize the legacy
"#/definitions/" prefix by returning self.definitions.get(name).cloned() for
that prefix and (2) add a debug/trace log when an unrecognized reference is
encountered (include the raw reference string) so callers (e.g.,
ParameterTypeRef emission) can trace why a ref fell back to "any"; locate the
function resolve_ref and add the logging path before returning None, and reuse
the existing logger/tracing facility or add a simple trace!()/log::debug!() call
consistent with the crate.

Comment thread crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja Outdated
@github-actions github-actions Bot removed the documentation Improvements or additions to documentation label May 6, 2026
@@ -0,0 +1,658 @@
//! Kimi-K2.5 tool-declaration encoder. See module-level docs.
//!
//! Mirrors `tool_declaration_ts.py` from the Kimi-K2.5 model snapshot
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a url link to this file?

Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs Outdated
Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs
- Add upstream Kimi-K2.5 reference URL to module docs.
- Replace module-level `#![allow(clippy::unwrap_used)]` with narrowly-
  scoped `#[expect(...)]` on infallible `write!`/`writeln!` sites.
- Bound JSON-schema recursion at 32 levels in `parse_parameter_type`;
  degrade to `any` on overflow to prevent stack overflow on adversarial
  tool schemas.
- Fix `assisitant` typo in chat_template.jinja fixture.

Signed-off-by: key4ng <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fcd8e9551

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

.iter()
.map(|(k, v)| (k.clone(), v.clone()))
.collect();
defs_clone.sort_by(|a, b| a.0.cmp(&b.0));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve $defs order when emitting interfaces

When a Kimi tool schema contains multiple $defs whose declaration order is not alphabetical, this sorts the generated interface blocks by name instead of preserving the order registered from the schema. The upstream Kimi encoder iterates its copied definitions in insertion order, so these prompts stop being byte-equivalent for such schemas; please keep an ordered registry/iteration rather than sorting the cloned definitions.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigation: Verified against upstream moonshotai/Kimi-K2.5/tool_declaration_ts.py — upstream preserves insertion order:

definitions_copy = dict(registry.definitions)
for def_name, def_schema in definitions_copy.items():
    ...

dict(...).items() is insertion-ordered in Python 3.7+. The PR's defs_clone.sort_by(|a, b| a.0.cmp(&b.0)) breaks byte-equivalence for any tool schema whose $defs keys aren't already alphabetical.

Suggested fix: swap SchemaRegistry::definitions: HashMap<String, Value> for IndexMap<String, Value> (or track a parallel Vec<String> of insertion order) and drop the sort.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs`:
- Around line 112-129: The loop that builds TypeScript definitions currently
always writes "interface {name} {body}", producing invalid TS for non-object
schemas; change the branch after calling parse_parameter_type(...) /
obj_type.to_typescript(...) to detect whether the parsed type is an object
(match on the ParameterType/Type enum variant used by parse_parameter_type,
e.g., ParameterType::Object or similar) and only emit "interface {name} {body}"
for that case; for all other variants emit a type alias like "type {name} =
{body};" (preserve the description handling and writing into def_str and pushing
into interfaces).
- Around line 133-143: The emitted TypeScript uses raw function and property
names which can produce invalid identifiers (e.g., hyphens); sanitize/normalize
names before interpolation: replace/normalize invalid identifier characters in
the computed function_name used in the type alias (variable function_name) to
produce a valid ECMAScript identifier (e.g., replace non-[A-Za-z0-9_$] with
underscore and ensure it doesn’t start with a digit, or prefix with `_`). For
object/parameter keys coming from parsed.to_typescript(...) (used when
root_interface_name is None and around the parameter emission at lines
~346-350), ensure property keys are emitted quoted if they are not valid
identifiers—either update parsed.to_typescript to quote keys or post-process its
output to wrap any non-identifier keys in quotes. Apply these changes where
function_name is created and where parsed.to_typescript is used so both type
alias names and object property keys are valid TypeScript.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bebd994c-8f2e-4119-8874-5c85c10e67f6

📥 Commits

Reviewing files that changed from the base of the PR and between 99c9c81 and 6fcd8e9.

📒 Files selected for processing (2)
  • crates/tokenizer/src/encoders/kimi_k25_tools.rs
  • crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja

Comment on lines +112 to +129
for (name, schema) in defs_clone {
let obj_type = parse_parameter_type(&schema, &mut registry);
let body = obj_type.to_typescript("", &registry);
let mut def_str = String::new();
if let Some(desc) = schema.get("description").and_then(Value::as_str) {
if !desc.is_empty() {
def_str.push_str(&format_description(desc, ""));
def_str.push('\n');
}
}
#[expect(
clippy::unwrap_used,
reason = "write!/writeln! into String cannot fail"
)]
{
write!(def_str, "interface {name} {body}").unwrap();
}
interfaces.push(def_str);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Is interface Color "red" | "blue"valid TypeScript syntax, or must non-object schema definitions be emitted astype Color = ...?

💡 Result:

interface Color "red" | "blue" is not valid TypeScript syntax. You must use a type alias for a union of non-object types. In TypeScript, interface is for describing object shapes, while unions are expressed via type aliases (e.g. type Color = "red" | "blue"). Interfaces can include union types only as property types, not as the interface’s own definition. Example: - Valid: type Color = "red" | "blue" [1][2] - Valid (union inside an interface property): interface Paint { color: "red" | "blue" } [2] - Not valid: interface Color = A | B (union at the top level requires type) [3] So if your schema generation emits something like interface Color "red" | "blue", it should instead emit type Color = "red" | "blue". [4]

Citations:


🏁 Script executed:

# Search for ParameterType definition
rg "enum ParameterType|struct ParameterType|type ParameterType" -A 10

# Also search for where Object variant is used
rg "ParameterType::Object" -B 2 -A 2

Repository: lightseekorg/smg

Length of output: 7916


🏁 Script executed:

# Read the context around lines 112-129
sed -n '100,140p' crates/tokenizer/src/encoders/kimi_k25_tools.rs | cat -n

Repository: lightseekorg/smg

Length of output: 1750


Use type aliases for non-object $defs entries.

The code currently renders all schema definitions as interface {name} {body}, which produces invalid TypeScript for non-object types (enums, scalars, arrays, unions). For example, interface Color "red" | "blue" or interface Tags Array<string> are invalid syntax. TypeScript requires non-object types to use type aliases instead.

Update lines 112–129 to emit interface only for Object variants and type aliases for all others:

Suggested fix
     for (name, schema) in defs_clone {
         let obj_type = parse_parameter_type(&schema, &mut registry);
         let body = obj_type.to_typescript("", &registry);
         let mut def_str = String::new();
         if let Some(desc) = schema.get("description").and_then(Value::as_str) {
             if !desc.is_empty() {
                 def_str.push_str(&format_description(desc, ""));
                 def_str.push('\n');
             }
         }
         #[expect(
             clippy::unwrap_used,
             reason = "write!/writeln! into String cannot fail"
         )]
         {
-            write!(def_str, "interface {name} {body}").unwrap();
+            if matches!(&obj_type, ParameterType::Object(_)) {
+                write!(def_str, "interface {name} {body}").unwrap();
+            } else {
+                write!(def_str, "type {name} = {body};").unwrap();
+            }
         }
         interfaces.push(def_str);
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 112 - 129, The
loop that builds TypeScript definitions currently always writes "interface
{name} {body}", producing invalid TS for non-object schemas; change the branch
after calling parse_parameter_type(...) / obj_type.to_typescript(...) to detect
whether the parsed type is an object (match on the ParameterType/Type enum
variant used by parse_parameter_type, e.g., ParameterType::Object or similar)
and only emit "interface {name} {body}" for that case; for all other variants
emit a type alias like "type {name} = {body};" (preserve the description
handling and writing into def_str and pushing into interfaces).

Comment on lines +133 to +143
let function_name = function
.get("name")
.and_then(Value::as_str)
.unwrap_or("function");
let type_def = match root_interface_name {
Some(n) => format!("type {function_name} = (_: {n}) => any;"),
None => format!(
"type {function_name} = (_: {}) => any;",
parsed.to_typescript("", &registry)
),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In TypeScript, do object property names containing -or spaces need to be quoted, and can a type alias name likeget-user be emitted without normalization?

💡 Result:

Yes, object property names containing - or spaces must be quoted in TypeScript object literals, as they are invalid JavaScript identifiers.[1][2][3][4][5] Unquoted, the parser interprets - as subtraction or spaces as invalid token separators, causing syntax errors.[6][4] No, a type alias name like get-user cannot be emitted without normalization (i.e., cannot be used as-is). Type alias names must be valid ECMAScript Identifiers, which cannot contain - or spaces; hyphens and spaces are not allowed in unquoted identifiers per ECMAScript grammar.[7][8][9][10] Attempting type get-user = string; results in a syntax error, as get-user is not a valid IdentifierName. TypeScript follows these rules for declaration parsing and emit (type aliases are erased in JS output but invalid syntax prevents compilation).[11][12][10][13]

Citations:


🏁 Script executed:

wc -l crates/tokenizer/src/encoders/kimi_k25_tools.rs

Repository: lightseekorg/smg

Length of output: 112


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '125,155p'

Repository: lightseekorg/smg

Length of output: 1158


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '340,360p'

Repository: lightseekorg/smg

Length of output: 725


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '1,50p'

Repository: lightseekorg/smg

Length of output: 2084


🏁 Script executed:

grep -n "struct Parameter" crates/tokenizer/src/encoders/kimi_k25_tools.rs -A 5

Repository: lightseekorg/smg

Length of output: 1095


🏁 Script executed:

grep -n "impl.*BaseType\|fn to_typescript" crates/tokenizer/src/encoders/kimi_k25_tools.rs | head -20

Repository: lightseekorg/smg

Length of output: 681


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '200,250p'

Repository: lightseekorg/smg

Length of output: 2328


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '358,425p'

Repository: lightseekorg/smg

Length of output: 2975


🏁 Script executed:

grep -n "fn from_json\|Parameter {" crates/tokenizer/src/encoders/kimi_k25_tools.rs | head -20

Repository: lightseekorg/smg

Length of output: 144


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '620,680p'

Repository: lightseekorg/smg

Length of output: 2599


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '315,356p'

Repository: lightseekorg/smg

Length of output: 1648


🏁 Script executed:

cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '420,450p'

Repository: lightseekorg/smg

Length of output: 1651


🏁 Script executed:

grep -rn "get-user\|foo-bar\|-" crates/tokenizer/tests/ | grep -i kimi | head -20

Repository: lightseekorg/smg

Length of output: 2022


🏁 Script executed:

find . -name "*kimi*test*" -o -name "*test*kimi*" 2>/dev/null

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

cat -n crates/tokenizer/tests/kimi_k25.rs 2>/dev/null | head -50

Repository: lightseekorg/smg

Length of output: 2259


🏁 Script executed:

cat -n crates/tokenizer/tests/kimi_k25.rs | sed -n '50,150p'

Repository: lightseekorg/smg

Length of output: 3258


🏁 Script executed:

grep -n "hyphen\|-user\|foo-bar" crates/tokenizer/tests/kimi_k25.rs

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

tail -100 crates/tokenizer/tests/kimi_k25.rs

Repository: lightseekorg/smg

Length of output: 3397


Sanitize function and property names before emitting TypeScript code.

Function names (line 138, 140) and parameter names (line 349) are interpolated without sanitization. Names like get-user or foo-bar produce invalid TypeScript (type get-user = ... or foo-bar: string). Type alias names must be valid ECMAScript identifiers (hyphens not allowed), and object property names with hyphens must be quoted. Normalize or alias function names, and quote property keys in the output.

Also applies to: 346-350

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 133 - 143, The
emitted TypeScript uses raw function and property names which can produce
invalid identifiers (e.g., hyphens); sanitize/normalize names before
interpolation: replace/normalize invalid identifier characters in the computed
function_name used in the type alias (variable function_name) to produce a valid
ECMAScript identifier (e.g., replace non-[A-Za-z0-9_$] with underscore and
ensure it doesn’t start with a digit, or prefix with `_`). For object/parameter
keys coming from parsed.to_typescript(...) (used when root_interface_name is
None and around the parameter emission at lines ~346-350), ensure property keys
are emitted quoted if they are not valid identifiers—either update
parsed.to_typescript to quote keys or post-process its output to wrap any
non-identifier keys in quotes. Apply these changes where function_name is
created and where parsed.to_typescript is used so both type alias names and
object property keys are valid TypeScript.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7171022e63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"parameters".to_string()
}
Some(_) => reference.rsplit('/').next().unwrap_or("").to_string(),
None => "any".to_string(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail unresolved $ref instead of silently emitting any

When a tool schema contains an unresolved or unsupported $ref, this code converts it to any and still emits tools_ts_str, which keeps the template on the TypeScript branch with lossy declarations. In Kimi’s Python tokenizer, unresolved refs raise during conversion and apply_chat_template falls back to the JSON tool declaration path, so this behavior diverges in exactly the cases where schema fidelity is already fragile and can degrade tool-call correctness.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigation: Verified against upstream — resolve_ref raises:

def resolve_ref(self, ref: str) -> dict[str, Any]:
    if ref == "#":
        self.has_self_ref = True
        return {"$self_ref": True}
    elif ref.startswith("#/$defs/"):
        def_name = ref.split("/")[-1]
        if def_name not in self.definitions:
            raise ValueError(f"Reference not found: {ref}")
        return self.definitions[def_name]
    else:
        raise ValueError(f"Unsupported reference format: {ref}")

There is no fallback to None or "any". The Python apply_chat_template catches this and drops to the JSON tool-declaration path, so byte-equivalence with upstream is violated in exactly the cases where TS-namespace fidelity is already broken.

The current Rust path:

  • resolve_ref returns None for unknown / unsupported prefixes
  • ParameterTypeRef::parse maps that to "any" and stays on the TS branch with lossy types

Trade-off: raising in apply_chat_template requires the caller (server) to know how to fall back to JSON. Cheapest improvement that preserves observability: emit tracing::warn! when this happens so the silent degradation is visible. Better: propagate Err and let apply_chat_template decide the fallback policy. cc @key4ng

Comment on lines +662 to +664
// Fallthrough: schemas with no type/anyOf/enum/$ref. Degrade to `any`
// permissively rather than erroring — matches the Python reference.
ParameterType::Scalar(ParameterTypeScalar::any())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop coercing unsupported schema objects to any

The fallback here turns any non-empty schema object without type/enum/anyOf/$ref into any (for example schemas using oneOf, allOf, or const), so conversion succeeds and the template stays on the TypeScript path with dropped constraints. Upstream Kimi logic raises on these invalid/unsupported shapes and then falls back to JSON tool declarations, so this permissive coercion produces a different prompt format in real schemas that should not be silently approximated.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigation: Verified — the in-code comment is misleading. Upstream raises:

def _parse_parameter_type(json_schema_object, registry=None):
    # ... checks for $ref, anyOf, enum, type ...
    elif json_schema_object == {}:
        return _ParameterTypeScalar(type="any")
    else:
        raise ValueError(f"Invalid JSON Schema object: {json_schema_object}")

So upstream only coerces an empty object {} to any. Schemas with oneOf / allOf / const / any other non-empty content raise, and apply_chat_template then falls back to JSON tool declarations.

The Rust code:

if obj.is_empty() {
    return ParameterType::Scalar(ParameterTypeScalar::any());
}
// Fallthrough: schemas with no type/anyOf/enum/$ref. Degrade to `any`
// permissively rather than erroring — matches the Python reference.
ParameterType::Scalar(ParameterTypeScalar::any())

The empty-object branch matches upstream. The fallthrough does not — the inline comment "matches the Python reference" is incorrect.

Suggested minimum: at least fix the misleading comment + tracing::warn! when this path fires. Better: return Err and let apply_chat_template decide JSON-fallback policy (same trade-off as the unresolved-$ref comment above). cc @key4ng

Comment thread crates/tokenizer/src/encoders/kimi_k25_tools.rs
@@ -0,0 +1,131 @@
//! Essential coverage for the Kimi-K2.5 tool renderer:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this one to kimi_k25_renderer_detection.rs?

/// use. Missing / unreadable / malformed config falls back to `Renderer::Jinja`
/// silently with a debug log, mirroring `huggingface.rs::detect_renderer_from_config`.
fn detect_renderer_from_config(dir: &Path) -> Renderer {
let path = dir.join("config.json");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we extract a small helper architectures_from_config and make both detect_renderer_from_config call from it so we can reduce some of the code duplication here. Your call. It is not a big duplication now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Test changes tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants