fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates#1448
fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates#1448key4ng wants to merge 18 commits into
Conversation
📝 WalkthroughWalkthroughAdds a Kimi-K2.5 TypeScript tool-declaration encoder, detects a renderer from tokenizer config, routes TiktokenTokenizer to the new encoder or Jinja template accordingly, and includes a chat-template fixture plus unit and end-to-end tests. ChangesKimi-K2.5 Tool Encoder Integration
Sequence DiagramsequenceDiagram
participant Client
participant TiktokenTokenizer
participant ConfigDetection
participant KimiEncoder as KimiK25Tools
participant SchemaRegistry
participant TemplateEngine
Client->>TiktokenTokenizer: apply_chat_template(messages, params)
TiktokenTokenizer->>ConfigDetection: detect_renderer_from_config(dir)
ConfigDetection-->>TiktokenTokenizer: Renderer::KimiK25Tools
TiktokenTokenizer->>KimiEncoder: apply_kimi_k25_tools(chat_template, messages, params)
KimiEncoder->>KimiEncoder: encode_tools_to_typescript(tools)
KimiEncoder->>SchemaRegistry: parse JSON schemas / $defs
SchemaRegistry-->>KimiEncoder: ParameterType nodes
KimiEncoder-->>TiktokenTokenizer: tools_ts_str
TiktokenTokenizer->>TemplateEngine: apply(chat_template, messages, template_kwargs + tools_ts_str)
TemplateEngine-->>TiktokenTokenizer: rendered output
TiktokenTokenizer-->>Client: rendered chat with TS tools
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements a specialized tool-declaration encoder for Kimi-K2.5 models, enabling the conversion of OpenAI-style tool definitions into the TypeScript-namespace format required by the model. The TiktokenTokenizer has been updated to support architecture-based renderer detection and dispatch. Review feedback identified a potential stack overflow vulnerability due to the lack of a recursion depth limit in the encoder and suggested an optimization to avoid unnecessary allocations during architecture string matching.
| } | ||
| } | ||
|
|
||
| fn parse_parameter_type(schema: &Value, registry: &mut SchemaRegistry) -> ParameterType { |
There was a problem hiding this comment.
The tool-declaration encoder performs recursive descent parsing without a depth limit. Although the design specification (docs/superpowers/specs/2026-05-05-kimi-k25-tool-format-design.md, line 133) explicitly mentions a recursion budget of 32 levels, it is not implemented in the code. A deeply nested or malicious tool schema could cause a stack overflow during both the parsing and rendering phases. Please implement a recursion depth check in parse_parameter_type and its recursive callers, falling back to a safe type (e.g., any) when the limit is reached. Additionally, ensure that protocol-surface types are decorated with #[serde(deny_unknown_fields)] to surface spec divergences, and that tool names are JSON-escaped before interpolation.
References
- Protocol-surface types should be decorated with #[serde(deny_unknown_fields)] to ensure that spec divergences are surfaced as errors.
- Tool names should be JSON-escaped before interpolation to correctly handle special characters like quotes, backslashes, and unicode.
There was a problem hiding this comment.
Investigation: The depth-limit half of this comment is addressed in the current code:
const MAX_RECURSION_DEPTH: usize = 32;
fn parse_parameter_type(schema: &Value, registry: &mut SchemaRegistry) -> ParameterType {
if registry.depth >= MAX_RECURSION_DEPTH {
return ParameterType::Scalar(ParameterTypeScalar::any());
}
registry.depth += 1;
let result = parse_parameter_type_inner(schema, registry);
registry.depth -= 1;
...
}MAX_RECURSION_DEPTH = 32 matches the spec budget. Adversarial schemas degrade to any instead of stack-overflowing. ✅
The other two items in this comment still apply:
#[serde(deny_unknown_fields)]on protocol-surface types — I didn't see any of the schema-shape structs (ParameterTypeObject,ParameterTypeArray, etc.) tagged with this. They're parsed viaserde_json::Valuelookups rather than typed deserialization, sodeny_unknown_fieldsdoesn't directly apply, but if you ever switch to typed deserialization for the function-call surface, add it then.- Tool names should be JSON-escaped before interpolation. I didn't spot escaping at the namespace-emit site. A function named e.g.
say "hi"or one with a backslash would produce invalid TypeScript / Markdown. Worth a quick audit. cc @key4ng
| let arch_strs: Vec<&str> = architectures | ||
| .map(|a| a.iter().filter_map(|v| v.as_str()).collect()) | ||
| .unwrap_or_default(); | ||
| if arch_strs.contains(&"KimiK25ForConditionalGeneration") { |
There was a problem hiding this comment.
This block allocates a temporary Vec<&str> just to check for the presence of a specific architecture string. You can avoid this allocation by using any directly on the iterator. This is more efficient and aligns with repository practices for refactoring logic to avoid duplication and unnecessary intermediate allocations.
| let arch_strs: Vec<&str> = architectures | |
| .map(|a| a.iter().filter_map(|v| v.as_str()).collect()) | |
| .unwrap_or_default(); | |
| if arch_strs.contains(&"KimiK25ForConditionalGeneration") { | |
| let is_kimi = architectures.is_some_and(|a| { | |
| a.iter().any(|v| v.as_str() == Some("KimiK25ForConditionalGeneration")) | |
| }); | |
| if is_kimi { |
References
- Refactor match statements to avoid duplication. When arms have common logic, use the match to return the differing value and perform the common logic once.
- When computing hashes from a slice of primitives (e.g., [u32]), avoid unnecessary intermediate allocations like collecting all bytes into a Vec. Instead, use the streaming interface of the hashing library (e.g., xxhash-rust's Xxh3::with_seed) to write bytes incrementally, improving performance.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
…mi-K2.5 encoder Signed-off-by: key4ng <[email protected]>
…-K2.5 Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
…imi-K2.5 Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
… test Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
Signed-off-by: key4ng <[email protected]>
bdb113c to
99c9c81
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99c9c8158b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs`:
- Around line 174-183: The resolve_ref method currently only handles "#" and
"#/$defs/" and silently returns None for other $ref patterns; update resolve_ref
(and related state like has_self_ref and the definitions map) to (1) recognize
the legacy "#/definitions/" prefix by returning
self.definitions.get(name).cloned() for that prefix and (2) add a debug/trace
log when an unrecognized reference is encountered (include the raw reference
string) so callers (e.g., ParameterTypeRef emission) can trace why a ref fell
back to "any"; locate the function resolve_ref and add the logging path before
returning None, and reuse the existing logger/tracing facility or add a simple
trace!()/log::debug!() call consistent with the crate.
- Line 7: The file currently uses a module-level #![allow(clippy::unwrap_used)]
which suppresses unwrap warnings everywhere; narrow this by removing the
module-level attribute and instead add #[allow(clippy::unwrap_used)] only around
the safe, known-to-never-fail write! usages (the blocks calling write! into a
String) or annotate the specific functions that contain those write! calls, and
replace unsafe unwrap patterns used for schema parsing (notably the
as_str().unwrap_or("") usage) with explicit handling: either use
Option::map_or("", |s| s) or return/propagate a Result so parsing paths handle
missing values instead of unwrapping. Ensure annotations reference the exact
functions/blocks with write! and the code calling as_str() so linting remains
enabled elsewhere.
- Around line 54-66: The current match on (params.template_kwargs,
ts_str.as_ref()) drops caller-provided template kwargs when ts_str is None by
returning None; update the match so that when params.template_kwargs is
Some(existing) and ts_str is None you return Some(existing.clone()) instead of
None (keep the existing branches for (Some, Some) and (None, Some) and leave
(None, None) -> None), ensuring the variable owned preserves existing kwargs
when ts_str is absent.
In `@crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja`:
- Line 40: Fix the typo in the Jinja comment that reads "{# Find last
non-tool-call assisitant message #}" by changing "assisitant" to "assistant" so
the comment becomes "{# Find last non-tool-call assistant message #}"—update the
comment text in the chat_template.jinja file where that exact comment appears.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: eabf009d-5581-4f18-a421-3e9cc9be3898
📒 Files selected for processing (5)
crates/tokenizer/src/encoders/kimi_k25_tools.rscrates/tokenizer/src/encoders/mod.rscrates/tokenizer/src/tiktoken.rscrates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinjacrates/tokenizer/tests/kimi_k25.rs
| fn resolve_ref(&mut self, reference: &str) -> Option<Value> { | ||
| if reference == "#" { | ||
| self.has_self_ref = true; | ||
| return Some(serde_json::json!({"$self_ref": true})); | ||
| } | ||
| if let Some(name) = reference.strip_prefix("#/$defs/") { | ||
| return self.definitions.get(name).cloned(); | ||
| } | ||
| None | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Limited $ref pattern support may silently degrade schemas.
resolve_ref only handles "#" and "#/$defs/" patterns. Other valid JSON Schema references (e.g., "#/definitions/" used by older OpenAPI specs, or external file refs) return None, causing ParameterTypeRef to emit "any". This is likely intentional to match the Python reference, but consider adding a debug/trace log for unrecognized refs to aid debugging during integration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 174 - 183, The
resolve_ref method currently only handles "#" and "#/$defs/" and silently
returns None for other $ref patterns; update resolve_ref (and related state like
has_self_ref and the definitions map) to (1) recognize the legacy
"#/definitions/" prefix by returning self.definitions.get(name).cloned() for
that prefix and (2) add a debug/trace log when an unrecognized reference is
encountered (include the raw reference string) so callers (e.g.,
ParameterTypeRef emission) can trace why a ref fell back to "any"; locate the
function resolve_ref and add the logging path before returning None, and reuse
the existing logger/tracing facility or add a simple trace!()/log::debug!() call
consistent with the crate.
| @@ -0,0 +1,658 @@ | |||
| //! Kimi-K2.5 tool-declaration encoder. See module-level docs. | |||
| //! | |||
| //! Mirrors `tool_declaration_ts.py` from the Kimi-K2.5 model snapshot | |||
There was a problem hiding this comment.
Can we add a url link to this file?
- Add upstream Kimi-K2.5 reference URL to module docs. - Replace module-level `#![allow(clippy::unwrap_used)]` with narrowly- scoped `#[expect(...)]` on infallible `write!`/`writeln!` sites. - Bound JSON-schema recursion at 32 levels in `parse_parameter_type`; degrade to `any` on overflow to prevent stack overflow on adversarial tool schemas. - Fix `assisitant` typo in chat_template.jinja fixture. Signed-off-by: key4ng <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fcd8e9551
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .iter() | ||
| .map(|(k, v)| (k.clone(), v.clone())) | ||
| .collect(); | ||
| defs_clone.sort_by(|a, b| a.0.cmp(&b.0)); |
There was a problem hiding this comment.
Preserve $defs order when emitting interfaces
When a Kimi tool schema contains multiple $defs whose declaration order is not alphabetical, this sorts the generated interface blocks by name instead of preserving the order registered from the schema. The upstream Kimi encoder iterates its copied definitions in insertion order, so these prompts stop being byte-equivalent for such schemas; please keep an ordered registry/iteration rather than sorting the cloned definitions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Investigation: Verified against upstream moonshotai/Kimi-K2.5/tool_declaration_ts.py — upstream preserves insertion order:
definitions_copy = dict(registry.definitions)
for def_name, def_schema in definitions_copy.items():
...dict(...).items() is insertion-ordered in Python 3.7+. The PR's defs_clone.sort_by(|a, b| a.0.cmp(&b.0)) breaks byte-equivalence for any tool schema whose $defs keys aren't already alphabetical.
Suggested fix: swap SchemaRegistry::definitions: HashMap<String, Value> for IndexMap<String, Value> (or track a parallel Vec<String> of insertion order) and drop the sort.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs`:
- Around line 112-129: The loop that builds TypeScript definitions currently
always writes "interface {name} {body}", producing invalid TS for non-object
schemas; change the branch after calling parse_parameter_type(...) /
obj_type.to_typescript(...) to detect whether the parsed type is an object
(match on the ParameterType/Type enum variant used by parse_parameter_type,
e.g., ParameterType::Object or similar) and only emit "interface {name} {body}"
for that case; for all other variants emit a type alias like "type {name} =
{body};" (preserve the description handling and writing into def_str and pushing
into interfaces).
- Around line 133-143: The emitted TypeScript uses raw function and property
names which can produce invalid identifiers (e.g., hyphens); sanitize/normalize
names before interpolation: replace/normalize invalid identifier characters in
the computed function_name used in the type alias (variable function_name) to
produce a valid ECMAScript identifier (e.g., replace non-[A-Za-z0-9_$] with
underscore and ensure it doesn’t start with a digit, or prefix with `_`). For
object/parameter keys coming from parsed.to_typescript(...) (used when
root_interface_name is None and around the parameter emission at lines
~346-350), ensure property keys are emitted quoted if they are not valid
identifiers—either update parsed.to_typescript to quote keys or post-process its
output to wrap any non-identifier keys in quotes. Apply these changes where
function_name is created and where parsed.to_typescript is used so both type
alias names and object property keys are valid TypeScript.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: bebd994c-8f2e-4119-8874-5c85c10e67f6
📒 Files selected for processing (2)
crates/tokenizer/src/encoders/kimi_k25_tools.rscrates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja
| for (name, schema) in defs_clone { | ||
| let obj_type = parse_parameter_type(&schema, &mut registry); | ||
| let body = obj_type.to_typescript("", ®istry); | ||
| let mut def_str = String::new(); | ||
| if let Some(desc) = schema.get("description").and_then(Value::as_str) { | ||
| if !desc.is_empty() { | ||
| def_str.push_str(&format_description(desc, "")); | ||
| def_str.push('\n'); | ||
| } | ||
| } | ||
| #[expect( | ||
| clippy::unwrap_used, | ||
| reason = "write!/writeln! into String cannot fail" | ||
| )] | ||
| { | ||
| write!(def_str, "interface {name} {body}").unwrap(); | ||
| } | ||
| interfaces.push(def_str); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Is interface Color "red" | "blue"valid TypeScript syntax, or must non-object schema definitions be emitted astype Color = ...?
💡 Result:
interface Color "red" | "blue" is not valid TypeScript syntax. You must use a type alias for a union of non-object types. In TypeScript, interface is for describing object shapes, while unions are expressed via type aliases (e.g. type Color = "red" | "blue"). Interfaces can include union types only as property types, not as the interface’s own definition. Example: - Valid: type Color = "red" | "blue" [1][2] - Valid (union inside an interface property): interface Paint { color: "red" | "blue" } [2] - Not valid: interface Color = A | B (union at the top level requires type) [3] So if your schema generation emits something like interface Color "red" | "blue", it should instead emit type Color = "red" | "blue". [4]
Citations:
- 1: https://typescriptlang.org/docs/handbook/unions-and-intersections.html
- 2: https://www.typescriptlang.org/docs/handbook/literal-types.html
- 3: https://stackoverflow.com/questions/42156270/create-typescript-interface-as-union-of-other-interfaces
- 4: https://www.typescriptlang.org/docs/handbook/advanced-types.html
🏁 Script executed:
# Search for ParameterType definition
rg "enum ParameterType|struct ParameterType|type ParameterType" -A 10
# Also search for where Object variant is used
rg "ParameterType::Object" -B 2 -A 2Repository: lightseekorg/smg
Length of output: 7916
🏁 Script executed:
# Read the context around lines 112-129
sed -n '100,140p' crates/tokenizer/src/encoders/kimi_k25_tools.rs | cat -nRepository: lightseekorg/smg
Length of output: 1750
Use type aliases for non-object $defs entries.
The code currently renders all schema definitions as interface {name} {body}, which produces invalid TypeScript for non-object types (enums, scalars, arrays, unions). For example, interface Color "red" | "blue" or interface Tags Array<string> are invalid syntax. TypeScript requires non-object types to use type aliases instead.
Update lines 112–129 to emit interface only for Object variants and type aliases for all others:
Suggested fix
for (name, schema) in defs_clone {
let obj_type = parse_parameter_type(&schema, &mut registry);
let body = obj_type.to_typescript("", ®istry);
let mut def_str = String::new();
if let Some(desc) = schema.get("description").and_then(Value::as_str) {
if !desc.is_empty() {
def_str.push_str(&format_description(desc, ""));
def_str.push('\n');
}
}
#[expect(
clippy::unwrap_used,
reason = "write!/writeln! into String cannot fail"
)]
{
- write!(def_str, "interface {name} {body}").unwrap();
+ if matches!(&obj_type, ParameterType::Object(_)) {
+ write!(def_str, "interface {name} {body}").unwrap();
+ } else {
+ write!(def_str, "type {name} = {body};").unwrap();
+ }
}
interfaces.push(def_str);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 112 - 129, The
loop that builds TypeScript definitions currently always writes "interface
{name} {body}", producing invalid TS for non-object schemas; change the branch
after calling parse_parameter_type(...) / obj_type.to_typescript(...) to detect
whether the parsed type is an object (match on the ParameterType/Type enum
variant used by parse_parameter_type, e.g., ParameterType::Object or similar)
and only emit "interface {name} {body}" for that case; for all other variants
emit a type alias like "type {name} = {body};" (preserve the description
handling and writing into def_str and pushing into interfaces).
| let function_name = function | ||
| .get("name") | ||
| .and_then(Value::as_str) | ||
| .unwrap_or("function"); | ||
| let type_def = match root_interface_name { | ||
| Some(n) => format!("type {function_name} = (_: {n}) => any;"), | ||
| None => format!( | ||
| "type {function_name} = (_: {}) => any;", | ||
| parsed.to_typescript("", ®istry) | ||
| ), | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In TypeScript, do object property names containing -or spaces need to be quoted, and can a type alias name likeget-user be emitted without normalization?
💡 Result:
Yes, object property names containing - or spaces must be quoted in TypeScript object literals, as they are invalid JavaScript identifiers.[1][2][3][4][5] Unquoted, the parser interprets - as subtraction or spaces as invalid token separators, causing syntax errors.[6][4] No, a type alias name like get-user cannot be emitted without normalization (i.e., cannot be used as-is). Type alias names must be valid ECMAScript Identifiers, which cannot contain - or spaces; hyphens and spaces are not allowed in unquoted identifiers per ECMAScript grammar.[7][8][9][10] Attempting type get-user = string; results in a syntax error, as get-user is not a valid IdentifierName. TypeScript follows these rules for declaration parsing and emit (type aliases are erased in JS output but invalid syntax prevents compilation).[11][12][10][13]
Citations:
- 1: Better handlng of hyphenated properties microsoft/TypeScript#4032
- 2: https://www.typescripts.org/docs/handbook/2/objects.html
- 3: https://www.typescriptlang.org/docs/handbook/interfaces
- 4: https://stackoverflow.com/questions/38441719/how-can-i-construct-an-object-with-keys-that-contain-hyphens
- 5: https://designcise.com/web/tutorial/can-javascript-object-property-name-have-dashes
- 6: https://stackoverflow.com/questions/7122609/how-do-i-reference-a-javascript-object-property-with-a-hyphen-in-it
- 7: JSDoc Property Name with Dash Not Recognized microsoft/TypeScript#14395
- 8: Typescript supports less valid variable names characters than javascript microsoft/TypeScript#42452
- 9: https://tc39.es/ecma262/multipage/ecmascript-language-lexical-grammar.html
- 10: https://www.typescriptlang.org/docs/handbook/2/everyday-types.html
- 11: Better error message for "Cannot find name" when looking for a type microsoft/TypeScript#7900
- 12: https://www.typescriptlang.org/docs/handbook/2/objects.html
- 13: https://github.com/microsoft/TypeScript/blob/master/src/compiler/emitter.ts
🏁 Script executed:
wc -l crates/tokenizer/src/encoders/kimi_k25_tools.rsRepository: lightseekorg/smg
Length of output: 112
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '125,155p'Repository: lightseekorg/smg
Length of output: 1158
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '340,360p'Repository: lightseekorg/smg
Length of output: 725
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '1,50p'Repository: lightseekorg/smg
Length of output: 2084
🏁 Script executed:
grep -n "struct Parameter" crates/tokenizer/src/encoders/kimi_k25_tools.rs -A 5Repository: lightseekorg/smg
Length of output: 1095
🏁 Script executed:
grep -n "impl.*BaseType\|fn to_typescript" crates/tokenizer/src/encoders/kimi_k25_tools.rs | head -20Repository: lightseekorg/smg
Length of output: 681
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '200,250p'Repository: lightseekorg/smg
Length of output: 2328
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '358,425p'Repository: lightseekorg/smg
Length of output: 2975
🏁 Script executed:
grep -n "fn from_json\|Parameter {" crates/tokenizer/src/encoders/kimi_k25_tools.rs | head -20Repository: lightseekorg/smg
Length of output: 144
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '620,680p'Repository: lightseekorg/smg
Length of output: 2599
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '315,356p'Repository: lightseekorg/smg
Length of output: 1648
🏁 Script executed:
cat -n crates/tokenizer/src/encoders/kimi_k25_tools.rs | sed -n '420,450p'Repository: lightseekorg/smg
Length of output: 1651
🏁 Script executed:
grep -rn "get-user\|foo-bar\|-" crates/tokenizer/tests/ | grep -i kimi | head -20Repository: lightseekorg/smg
Length of output: 2022
🏁 Script executed:
find . -name "*kimi*test*" -o -name "*test*kimi*" 2>/dev/nullRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
cat -n crates/tokenizer/tests/kimi_k25.rs 2>/dev/null | head -50Repository: lightseekorg/smg
Length of output: 2259
🏁 Script executed:
cat -n crates/tokenizer/tests/kimi_k25.rs | sed -n '50,150p'Repository: lightseekorg/smg
Length of output: 3258
🏁 Script executed:
grep -n "hyphen\|-user\|foo-bar" crates/tokenizer/tests/kimi_k25.rsRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
tail -100 crates/tokenizer/tests/kimi_k25.rsRepository: lightseekorg/smg
Length of output: 3397
Sanitize function and property names before emitting TypeScript code.
Function names (line 138, 140) and parameter names (line 349) are interpolated without sanitization. Names like get-user or foo-bar produce invalid TypeScript (type get-user = ... or foo-bar: string). Type alias names must be valid ECMAScript identifiers (hyphens not allowed), and object property names with hyphens must be quoted. Normalize or alias function names, and quote property keys in the output.
Also applies to: 346-350
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/tokenizer/src/encoders/kimi_k25_tools.rs` around lines 133 - 143, The
emitted TypeScript uses raw function and property names which can produce
invalid identifiers (e.g., hyphens); sanitize/normalize names before
interpolation: replace/normalize invalid identifier characters in the computed
function_name used in the type alias (variable function_name) to produce a valid
ECMAScript identifier (e.g., replace non-[A-Za-z0-9_$] with underscore and
ensure it doesn’t start with a digit, or prefix with `_`). For object/parameter
keys coming from parsed.to_typescript(...) (used when root_interface_name is
None and around the parameter emission at lines ~346-350), ensure property keys
are emitted quoted if they are not valid identifiers—either update
parsed.to_typescript to quote keys or post-process its output to wrap any
non-identifier keys in quotes. Apply these changes where function_name is
created and where parsed.to_typescript is used so both type alias names and
object property keys are valid TypeScript.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7171022e63
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "parameters".to_string() | ||
| } | ||
| Some(_) => reference.rsplit('/').next().unwrap_or("").to_string(), | ||
| None => "any".to_string(), |
There was a problem hiding this comment.
Fail unresolved $ref instead of silently emitting any
When a tool schema contains an unresolved or unsupported $ref, this code converts it to any and still emits tools_ts_str, which keeps the template on the TypeScript branch with lossy declarations. In Kimi’s Python tokenizer, unresolved refs raise during conversion and apply_chat_template falls back to the JSON tool declaration path, so this behavior diverges in exactly the cases where schema fidelity is already fragile and can degrade tool-call correctness.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Investigation: Verified against upstream — resolve_ref raises:
def resolve_ref(self, ref: str) -> dict[str, Any]:
if ref == "#":
self.has_self_ref = True
return {"$self_ref": True}
elif ref.startswith("#/$defs/"):
def_name = ref.split("/")[-1]
if def_name not in self.definitions:
raise ValueError(f"Reference not found: {ref}")
return self.definitions[def_name]
else:
raise ValueError(f"Unsupported reference format: {ref}")There is no fallback to None or "any". The Python apply_chat_template catches this and drops to the JSON tool-declaration path, so byte-equivalence with upstream is violated in exactly the cases where TS-namespace fidelity is already broken.
The current Rust path:
resolve_refreturnsNonefor unknown / unsupported prefixesParameterTypeRef::parsemaps that to"any"and stays on the TS branch with lossy types
Trade-off: raising in apply_chat_template requires the caller (server) to know how to fall back to JSON. Cheapest improvement that preserves observability: emit tracing::warn! when this happens so the silent degradation is visible. Better: propagate Err and let apply_chat_template decide the fallback policy. cc @key4ng
| // Fallthrough: schemas with no type/anyOf/enum/$ref. Degrade to `any` | ||
| // permissively rather than erroring — matches the Python reference. | ||
| ParameterType::Scalar(ParameterTypeScalar::any()) |
There was a problem hiding this comment.
Stop coercing unsupported schema objects to any
The fallback here turns any non-empty schema object without type/enum/anyOf/$ref into any (for example schemas using oneOf, allOf, or const), so conversion succeeds and the template stays on the TypeScript path with dropped constraints. Upstream Kimi logic raises on these invalid/unsupported shapes and then falls back to JSON tool declarations, so this permissive coercion produces a different prompt format in real schemas that should not be silently approximated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Investigation: Verified — the in-code comment is misleading. Upstream raises:
def _parse_parameter_type(json_schema_object, registry=None):
# ... checks for $ref, anyOf, enum, type ...
elif json_schema_object == {}:
return _ParameterTypeScalar(type="any")
else:
raise ValueError(f"Invalid JSON Schema object: {json_schema_object}")So upstream only coerces an empty object {} to any. Schemas with oneOf / allOf / const / any other non-empty content raise, and apply_chat_template then falls back to JSON tool declarations.
The Rust code:
if obj.is_empty() {
return ParameterType::Scalar(ParameterTypeScalar::any());
}
// Fallthrough: schemas with no type/anyOf/enum/$ref. Degrade to `any`
// permissively rather than erroring — matches the Python reference.
ParameterType::Scalar(ParameterTypeScalar::any())The empty-object branch matches upstream. The fallthrough does not — the inline comment "matches the Python reference" is incorrect.
Suggested minimum: at least fix the misleading comment + tracing::warn! when this path fires. Better: return Err and let apply_chat_template decide JSON-fallback policy (same trade-off as the unresolved-$ref comment above). cc @key4ng
| @@ -0,0 +1,131 @@ | |||
| //! Essential coverage for the Kimi-K2.5 tool renderer: | |||
There was a problem hiding this comment.
Let's rename this one to kimi_k25_renderer_detection.rs?
| /// use. Missing / unreadable / malformed config falls back to `Renderer::Jinja` | ||
| /// silently with a debug log, mirroring `huggingface.rs::detect_renderer_from_config`. | ||
| fn detect_renderer_from_config(dir: &Path) -> Renderer { | ||
| let path = dir.join("config.json"); |
There was a problem hiding this comment.
Maybe we extract a small helper architectures_from_config and make both detect_renderer_from_config call from it so we can reduce some of the code duplication here. Your call. It is not a big duplication now.
Description
Problem
Kimi-K2.5's
chat_template.jinjabranches on atools_ts_strvariable that the model's custom Python tokenizer (tokenization_kimi.py+tool_declaration_ts.py) injects viaapply_chat_template. SMG's Rust path renders the template directly with minijinja and never executes the Python preprocessor, sotools_ts_stris undefined and the template falls into the JSON fallback. Tools are emitted as raw JSON instead of the trained TypeScript-namespace format → ~1.7 pp BFCL drop onsimple_python.Solution
Hand-port the Python TS encoder to Rust and wire it into the Tiktoken loader behind a new renderer variant. Detection is driven by
config.json::architectures == "KimiK25ForConditionalGeneration", so other models (including Kimi-K2-Instruct and Kimi-K2-Thinking) are unaffected — they don't shiptool_declaration_ts.pyand don't referencetools_ts_strin their templates.Changes
crates/tokenizer/src/encoders/kimi_k25_tools.rs: new TS-namespace encoder +apply_kimi_k25_toolswrapper that computestools_ts_strand merges it intotemplate_kwargsbefore delegating to the existing minijinja render.crates/tokenizer/src/tiktoken.rs: newRendererenum (Jinja/KimiK25Tools) with architecture-based detection, mirroring the pattern inhuggingface.rs.crates/tokenizer/tests/kimi_k25.rs: consolidated test suite — encoder coverage for nested objects + optional fields, enum/union encoding, and an end-to-end render that proves dispatch wiring (transitively covers detection).crates/tokenizer/tests/fixtures/kimi_k25/chat_template.jinja: vendored chat template for the e2e test.Test Plan
cargo test -p llm-tokenizer --test kimi_k25→ 3 tests pass.cargo clippy -p llm-tokenizer --tests --all-features -- -D warningsclean.cargo +nightly fmt -p llm-tokenizer -- --checkclean.RUST_LOG=tokenizer=debug: logs showselected KimiK25Tools chat-template rendererat startup; rendered prompts contain# Tools\n\n## functions\nnamespace functions {.simple_pythonvs. unfixed baseline. Numbers will be attached as a comment.Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Tests