Skip to content

fix: CJK-aware token estimation in estimateTokens#47

Open
lishixiang0705 wants to merge 1 commit intoMartian-Engineering:mainfrom
lishixiang0705:fix/cjk-token-estimation
Open

fix: CJK-aware token estimation in estimateTokens#47
lishixiang0705 wants to merge 1 commit intoMartian-Engineering:mainfrom
lishixiang0705:fix/cjk-token-estimation

Conversation

@lishixiang0705
Copy link

Problem

The current estimateTokens function uses Math.ceil(text.length / 4), which assumes ~4 characters per token. This is accurate for English/ASCII but severely underestimates CJK (Chinese/Japanese/Korean) content by ~3x.

Root cause

In JavaScript, string.length returns UTF-16 code units where each CJK character = 1 unit. However, LLM tokenizers (Claude, GPT, etc.) encode CJK characters at ~1.5 tokens per character, not 0.25.

Production impact

In a Chinese-heavy OpenClaw session:

  • LCM estimated context at 59k tokens
  • Actual API usage was 174k tokens (3x underestimate)
  • Compaction triggered far too late → session hit the 200k hard limit and was force-reset

Verified with real data — a message with 40% CJK + 53% ASCII (28,949 chars):

Method Tokens
Old (length/4) 7,238
New (CJK-aware) 22,180
Ratio 3.1x

Fix

Apply per-character weighting in both assembler.ts and compaction.ts:

  • CJK Unified Ideographs (U+4E00–U+9FFF) + Extension A (U+3400–U+4DBF): 1.5 tokens/char
  • All other characters: 0.25 tokens/char (unchanged)

Zero dependencies added. Backwards compatible — only affects estimation accuracy.

The previous implementation (Math.ceil(text.length / 4)) assumes ~4
chars per token, which is accurate for English/ASCII text but severely
underestimates token counts for CJK (Chinese/Japanese/Korean) content.

In JavaScript, string.length returns UTF-16 code units, where each CJK
character counts as 1. However, most LLM tokenizers (Claude, GPT, etc.)
encode CJK characters at ~1.5 tokens per character, not 0.25.

This causes a ~3x underestimation for Chinese-heavy conversations:
- A message with 40% CJK + 53% ASCII (28,949 chars):
  - Old estimate: 7,238 tokens
  - Corrected estimate: 22,180 tokens (3.1x higher)
- In production, LCM estimated context at 59k when actual API usage
  was 174k, causing compaction to trigger far too late and sessions
  to hit the 200k hard limit.

The fix applies per-character weighting:
- CJK Unified Ideographs (U+4E00–U+9FFF) and Extension A
  (U+3400–U+4DBF): 1.5 tokens/char
- All other characters: 0.25 tokens/char (unchanged from before)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant