feat: add HuggingFace tokenizer integration for precise token counting#28
Open
liuy wants to merge 3 commits intoMartian-Engineering:mainfrom
Open
feat: add HuggingFace tokenizer integration for precise token counting#28liuy wants to merge 3 commits intoMartian-Engineering:mainfrom
liuy wants to merge 3 commits intoMartian-Engineering:mainfrom
Conversation
85f0d40 to
386a40f
Compare
Contributor
Author
|
@jalehman any inputs? This patch looks big because of glm5 tokenizer.json for the perf test, if you don't like it, i can remove perf test and tokenizer.json. Thanks for your time :) |
Contributor
|
Yeah, I think this is a bit too much to add for comfot. |
- Add HuggingFace tokenizer integration for accurate token counting - Sync tokenizer API and load from cache in constructor - Pass useTokenizer and tokenizer to estimateSessionTokenCountForAfterTurn - Add comprehensive tokenizer integration tests
386a40f to
b6958c7
Compare
Contributor
Author
@jalehman Done and I've been using tokenizer locally for almost a week and have not seen a compaction since then. Please review if you have a moment :) |
Contributor
Author
|
@jalehman any input to proceed?I noticed you had 2 commits added. Thanks for your time |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add HuggingFace Tokenizer Integration for Precise Token Counting
Background & Motivation
Lossless-Claw currently uses a simple heuristic (
chars / 4) to estimate token counts for context compaction decisions. This approach works reasonably for English text but has significant accuracy issues for other content types:Inaccurate token counting leads to:
This PR adds optional precise token counting using HuggingFace tokenizers, which provides accurate token counts for all content types while maintaining acceptable latency.
Usage
Configuration
Enable precise tokenizer in your OpenClaw config (
~/.openclaw/openclaw.json):{ "plugins": { "entries": { "lossless-claw": { "enabled": true, "config": { "useTokenizer": true, "proxy": "http://127.0.0.1:7890" } } } } }Options:
useTokenizer(boolean, default:false): Enable precise token countingproxy(string, optional): HTTP(S) proxy for downloading tokenizers from HuggingFace (needed in China)How It Works
useTokenizer: true, LCM reads your OpenClaw model config~/.openclaw/tokenizers/for future usechars/4heuristic if tokenizer fails or model not supportedSupported Models
zai/glm-5zai-org/GLM-5zai/glm-4.7zai-org/GLM-4.7minimax/MiniMax-M2.5MiniMaxAI/MiniMax-M2.5minimax/MiniMax-M2.1MiniMaxAI/MiniMax-M2.1deepseek/DeepSeek-V3.2deepseek-ai/DeepSeek-V3.2deepseek/DeepSeek-V3.1deepseek-ai/DeepSeek-V3.1Fallback: For unsupported models, we fall back to
chars/4heuristic automatically.Performance Results
Accuracy Comparison
Key Finding: The
chars/4heuristic underestimates Chinese text by 45.8%, which can cause compaction to trigger too late and exceed context limits.Latency Comparison
Key Finding: Precise tokenizer adds 0.125-0.415ms latency, which is acceptable for compaction scenarios (not in the critical path).
Implementation Details
agents.defaults.modelconfig to select matching tokenizerchars/4heuristic)~/.openclaw/tokenizers/after first downloadopenclaw.plugin.jsonupdated with all config options (useTokenizer,proxy, etc.)Files Changed
src/token-utils.ts: NewcountTokens()function with tokenizer supportsrc/tokenizers/huggingface.ts: HuggingFace tokenizer wrapper with auto-detectionsrc/db/config.ts: AddeduseTokenizerandproxyconfig optionssrc/types.ts: AddedTokenizerServiceinterfaceindex.ts: Create tokenizer instance with auto-detected model, inject into dependenciesopenclaw.plugin.json: UpdatedconfigSchemawith all LCM config optionsscripts/generate-manifest.ts: Script to auto-generate configSchema from codesrc/assembler.ts,src/compaction.ts,src/retrieval.ts,src/summarize.ts: Updated to usecountTokens()Tests
Dependencies
@huggingface/tokenizers0.1.2: Pure JavaScript tokenizer library (~5MB)