fix: CJK-aware token estimation in estimateTokens by lishixiang0705 · Pull Request #47 · Martian-Engineering/lossless-claw

lishixiang0705 · 2026-03-11T05:13:17Z

Problem

The current estimateTokens function uses Math.ceil(text.length / 4), which assumes ~4 characters per token. This is accurate for English/ASCII but severely underestimates CJK (Chinese/Japanese/Korean) content by ~3x.

Root cause

In JavaScript, string.length returns UTF-16 code units where each CJK character = 1 unit. However, LLM tokenizers (Claude, GPT, etc.) encode CJK characters at ~1.5 tokens per character, not 0.25.

Production impact

In a Chinese-heavy OpenClaw session:

LCM estimated context at 59k tokens
Actual API usage was 174k tokens (3x underestimate)
Compaction triggered far too late → session hit the 200k hard limit and was force-reset

Verified with real data — a message with 40% CJK + 53% ASCII (28,949 chars):

Method	Tokens
Old (`length/4`)	7,238
New (CJK-aware)	22,180
Ratio	3.1x

Fix

Apply per-character weighting in both assembler.ts and compaction.ts:

CJK Unified Ideographs (U+4E00–U+9FFF) + Extension A (U+3400–U+4DBF): 1.5 tokens/char
All other characters: 0.25 tokens/char (unchanged)

Zero dependencies added. Backwards compatible — only affects estimation accuracy.

The previous implementation (Math.ceil(text.length / 4)) assumes ~4 chars per token, which is accurate for English/ASCII text but severely underestimates token counts for CJK (Chinese/Japanese/Korean) content. In JavaScript, string.length returns UTF-16 code units, where each CJK character counts as 1. However, most LLM tokenizers (Claude, GPT, etc.) encode CJK characters at ~1.5 tokens per character, not 0.25. This causes a ~3x underestimation for Chinese-heavy conversations: - A message with 40% CJK + 53% ASCII (28,949 chars): - Old estimate: 7,238 tokens - Corrected estimate: 22,180 tokens (3.1x higher) - In production, LCM estimated context at 59k when actual API usage was 174k, causing compaction to trigger far too late and sessions to hit the 200k hard limit. The fix applies per-character weighting: - CJK Unified Ideographs (U+4E00–U+9FFF) and Extension A (U+3400–U+4DBF): 1.5 tokens/char - All other characters: 0.25 tokens/char (unchanged from before)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CJK-aware token estimation in estimateTokens#47

fix: CJK-aware token estimation in estimateTokens#47
lishixiang0705 wants to merge 1 commit intoMartian-Engineering:mainfrom
lishixiang0705:fix/cjk-token-estimation

lishixiang0705 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lishixiang0705 commented Mar 11, 2026

Problem

Root cause

Production impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant