feat: add HuggingFace tokenizer integration for precise token counting by liuy · Pull Request #28 · Martian-Engineering/lossless-claw

liuy · 2026-03-10T01:58:43Z

Add HuggingFace Tokenizer Integration for Precise Token Counting

Background & Motivation

Lossless-Claw currently uses a simple heuristic (chars / 4) to estimate token counts for context compaction decisions. This approach works reasonably for English text but has significant accuracy issues for other content types:

Chinese text: Underestimates by ~45.8%
Mixed content: Underestimates by ~28.0%
Code: Underestimates by ~19.7%
English text: Overestimates by ~23.1%

Inaccurate token counting leads to:

Premature compaction (wasting context window capacity)
Delayed compaction (exceeding model context limits)
Suboptimal summarization decisions

This PR adds optional precise token counting using HuggingFace tokenizers, which provides accurate token counts for all content types while maintaining acceptable latency.

Usage

Configuration

Enable precise tokenizer in your OpenClaw config (~/.openclaw/openclaw.json):

{
  "plugins": {
    "entries": {
      "lossless-claw": {
        "enabled": true,
        "config": {
          "useTokenizer": true,
          "proxy": "http://127.0.0.1:7890"
        }
      }
    }
  }
}

Options:

useTokenizer (boolean, default: false): Enable precise token counting
proxy (string, optional): HTTP(S) proxy for downloading tokenizers from HuggingFace (needed in China)

How It Works

When useTokenizer: true, LCM reads your OpenClaw model config
Automatically downloads the matching tokenizer from HuggingFace on first use
Caches tokenizer in ~/.openclaw/tokenizers/ for future use
Falls back to chars/4 heuristic if tokenizer fails or model not supported

Supported Models

OpenClaw Model	HuggingFace Tokenizer	Auto-Detected
`zai/glm-5`	`zai-org/GLM-5`	✅
`zai/glm-4.7`	`zai-org/GLM-4.7`	✅
`minimax/MiniMax-M2.5`	`MiniMaxAI/MiniMax-M2.5`	✅
`minimax/MiniMax-M2.1`	`MiniMaxAI/MiniMax-M2.1`	✅
`deepseek/DeepSeek-V3.2`	`deepseek-ai/DeepSeek-V3.2`	✅
`deepseek/DeepSeek-V3.1`	`deepseek-ai/DeepSeek-V3.1`	✅

Fallback: For unsupported models, we fall back to chars/4 heuristic automatically.

Performance Results

Accuracy Comparison

Type	Chars	chars/4	Precise	Diff
English	128	32	26	+23.1%
Chinese	51	13	24	-45.8%
Mixed	69	18	25	-28.0%
Code	211	53	66	-19.7%

Key Finding: The chars/4 heuristic underestimates Chinese text by 45.8%, which can cause compaction to trigger too late and exceed context limits.

Latency Comparison

Type	chars/4	tokenizer
English	0.026	0.415
Chinese	0.003	0.415
Mixed	0.002	0.125
Code	0.002	0.258

Key Finding: Precise tokenizer adds 0.125-0.415ms latency, which is acceptable for compaction scenarios (not in the critical path).

Implementation Details

Auto-detection: Reads OpenClaw agents.defaults.model config to select matching tokenizer
Opt-in: Default behavior unchanged (uses chars/4 heuristic)
Fallback: If tokenizer fails or model not supported, automatically falls back to heuristic
Cache: Tokenizers cached in ~/.openclaw/tokenizers/ after first download
Async: All token counting is async, no blocking operations
Proxy support: HTTP(S) proxy for downloading tokenizers in restricted networks
Config schema: openclaw.plugin.json updated with all config options (useTokenizer, proxy, etc.)

Files Changed

src/token-utils.ts: New countTokens() function with tokenizer support
src/tokenizers/huggingface.ts: HuggingFace tokenizer wrapper with auto-detection
src/db/config.ts: Added useTokenizer and proxy config options
src/types.ts: Added TokenizerService interface
index.ts: Create tokenizer instance with auto-detected model, inject into dependencies
openclaw.plugin.json: Updated configSchema with all LCM config options
scripts/generate-manifest.ts: Script to auto-generate configSchema from code
src/assembler.ts, src/compaction.ts, src/retrieval.ts, src/summarize.ts: Updated to use countTokens()

Tests

204 tests passing (tokenizer related)
Added unit tests for token counting logic
Added performance tests with real GLM-5 tokenizer

Dependencies

@huggingface/tokenizers 0.1.2: Pure JavaScript tokenizer library (~5MB)

liuy · 2026-03-12T08:59:39Z

@jalehman any inputs? This patch looks big because of glm5 tokenizer.json for the perf test, if you don't like it, i can remove perf test and tokenizer.json. Thanks for your time :)

jalehman · 2026-03-13T05:30:39Z

Yeah, I think this is a bit too much to add for comfot.

- Add HuggingFace tokenizer integration for accurate token counting - Sync tokenizer API and load from cache in constructor - Pass useTokenizer and tokenizer to estimateSessionTokenCountForAfterTurn - Add comprehensive tokenizer integration tests

liuy · 2026-03-14T12:42:00Z

Yeah, I think this is a bit too much to add for comfot.

@jalehman Done and I've been using tokenizer locally for almost a week and have not seen a compaction since then. Please review if you have a moment :)

liuy · 2026-03-18T09:12:12Z

@jalehman any input to proceed？I noticed you had 2 commits added. Thanks for your time

liuy force-pushed the feat/huggingface-tokenizer branch 9 times, most recently from 85f0d40 to 386a40f Compare March 11, 2026 01:02

liuy force-pushed the feat/huggingface-tokenizer branch from 386a40f to b6958c7 Compare March 14, 2026 12:37

jalehman added 2 commits March 16, 2026 06:28

fix: complete tokenizer integration

171b418

chore: remove manifest generator

875c61b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HuggingFace tokenizer integration for precise token counting#28

feat: add HuggingFace tokenizer integration for precise token counting#28
liuy wants to merge 3 commits intoMartian-Engineering:mainfrom
liuy:feat/huggingface-tokenizer

liuy commented Mar 10, 2026 •

edited

Loading

Uh oh!

liuy commented Mar 12, 2026

Uh oh!

jalehman commented Mar 13, 2026

Uh oh!

liuy commented Mar 14, 2026 •

edited

Loading

Uh oh!

liuy commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liuy commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add HuggingFace Tokenizer Integration for Precise Token Counting

Background & Motivation

Usage

Configuration

How It Works

Supported Models

Performance Results

Accuracy Comparison

Latency Comparison

Implementation Details

Files Changed

Tests

Dependencies

Uh oh!

liuy commented Mar 12, 2026

Uh oh!

jalehman commented Mar 13, 2026

Uh oh!

liuy commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liuy commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liuy commented Mar 10, 2026 •

edited

Loading

liuy commented Mar 14, 2026 •

edited

Loading