Skip to content

feat: add HuggingFace tokenizer integration for precise token counting#28

Open
liuy wants to merge 3 commits intoMartian-Engineering:mainfrom
liuy:feat/huggingface-tokenizer
Open

feat: add HuggingFace tokenizer integration for precise token counting#28
liuy wants to merge 3 commits intoMartian-Engineering:mainfrom
liuy:feat/huggingface-tokenizer

Conversation

@liuy
Copy link
Contributor

@liuy liuy commented Mar 10, 2026

Add HuggingFace Tokenizer Integration for Precise Token Counting

Background & Motivation

Lossless-Claw currently uses a simple heuristic (chars / 4) to estimate token counts for context compaction decisions. This approach works reasonably for English text but has significant accuracy issues for other content types:

  • Chinese text: Underestimates by ~45.8%
  • Mixed content: Underestimates by ~28.0%
  • Code: Underestimates by ~19.7%
  • English text: Overestimates by ~23.1%

Inaccurate token counting leads to:

  • Premature compaction (wasting context window capacity)
  • Delayed compaction (exceeding model context limits)
  • Suboptimal summarization decisions

This PR adds optional precise token counting using HuggingFace tokenizers, which provides accurate token counts for all content types while maintaining acceptable latency.

Usage

Configuration

Enable precise tokenizer in your OpenClaw config (~/.openclaw/openclaw.json):

{
  "plugins": {
    "entries": {
      "lossless-claw": {
        "enabled": true,
        "config": {
          "useTokenizer": true,
          "proxy": "http://127.0.0.1:7890"
        }
      }
    }
  }
}

Options:

  • useTokenizer (boolean, default: false): Enable precise token counting
  • proxy (string, optional): HTTP(S) proxy for downloading tokenizers from HuggingFace (needed in China)

How It Works

  1. When useTokenizer: true, LCM reads your OpenClaw model config
  2. Automatically downloads the matching tokenizer from HuggingFace on first use
  3. Caches tokenizer in ~/.openclaw/tokenizers/ for future use
  4. Falls back to chars/4 heuristic if tokenizer fails or model not supported

Supported Models

OpenClaw Model HuggingFace Tokenizer Auto-Detected
zai/glm-5 zai-org/GLM-5
zai/glm-4.7 zai-org/GLM-4.7
minimax/MiniMax-M2.5 MiniMaxAI/MiniMax-M2.5
minimax/MiniMax-M2.1 MiniMaxAI/MiniMax-M2.1
deepseek/DeepSeek-V3.2 deepseek-ai/DeepSeek-V3.2
deepseek/DeepSeek-V3.1 deepseek-ai/DeepSeek-V3.1

Fallback: For unsupported models, we fall back to chars/4 heuristic automatically.

Performance Results

Accuracy Comparison

Type Chars chars/4 Precise Diff
English 128 32 26 +23.1%
Chinese 51 13 24 -45.8%
Mixed 69 18 25 -28.0%
Code 211 53 66 -19.7%

Key Finding: The chars/4 heuristic underestimates Chinese text by 45.8%, which can cause compaction to trigger too late and exceed context limits.

Latency Comparison

Type chars/4 tokenizer
English 0.026 0.415
Chinese 0.003 0.415
Mixed 0.002 0.125
Code 0.002 0.258

Key Finding: Precise tokenizer adds 0.125-0.415ms latency, which is acceptable for compaction scenarios (not in the critical path).

Implementation Details

  • Auto-detection: Reads OpenClaw agents.defaults.model config to select matching tokenizer
  • Opt-in: Default behavior unchanged (uses chars/4 heuristic)
  • Fallback: If tokenizer fails or model not supported, automatically falls back to heuristic
  • Cache: Tokenizers cached in ~/.openclaw/tokenizers/ after first download
  • Async: All token counting is async, no blocking operations
  • Proxy support: HTTP(S) proxy for downloading tokenizers in restricted networks
  • Config schema: openclaw.plugin.json updated with all config options (useTokenizer, proxy, etc.)

Files Changed

  • src/token-utils.ts: New countTokens() function with tokenizer support
  • src/tokenizers/huggingface.ts: HuggingFace tokenizer wrapper with auto-detection
  • src/db/config.ts: Added useTokenizer and proxy config options
  • src/types.ts: Added TokenizerService interface
  • index.ts: Create tokenizer instance with auto-detected model, inject into dependencies
  • openclaw.plugin.json: Updated configSchema with all LCM config options
  • scripts/generate-manifest.ts: Script to auto-generate configSchema from code
  • src/assembler.ts, src/compaction.ts, src/retrieval.ts, src/summarize.ts: Updated to use countTokens()

Tests

  • 204 tests passing (tokenizer related)
  • Added unit tests for token counting logic
  • Added performance tests with real GLM-5 tokenizer

Dependencies

  • @huggingface/tokenizers 0.1.2: Pure JavaScript tokenizer library (~5MB)

@liuy liuy force-pushed the feat/huggingface-tokenizer branch 9 times, most recently from 85f0d40 to 386a40f Compare March 11, 2026 01:02
@liuy
Copy link
Contributor Author

liuy commented Mar 12, 2026

@jalehman any inputs? This patch looks big because of glm5 tokenizer.json for the perf test, if you don't like it, i can remove perf test and tokenizer.json. Thanks for your time :)

@jalehman
Copy link
Contributor

Yeah, I think this is a bit too much to add for comfot.

- Add HuggingFace tokenizer integration for accurate token counting
- Sync tokenizer API and load from cache in constructor
- Pass useTokenizer and tokenizer to estimateSessionTokenCountForAfterTurn
- Add comprehensive tokenizer integration tests
@liuy liuy force-pushed the feat/huggingface-tokenizer branch from 386a40f to b6958c7 Compare March 14, 2026 12:37
@liuy
Copy link
Contributor Author

liuy commented Mar 14, 2026

Yeah, I think this is a bit too much to add for comfot.

@jalehman Done and I've been using tokenizer locally for almost a week and have not seen a compaction since then. Please review if you have a moment :)

@liuy
Copy link
Contributor Author

liuy commented Mar 18, 2026

@jalehman any input to proceed?I noticed you had 2 commits added. Thanks for your time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants