Skip to content

Conversation

@burke
Copy link
Contributor

@burke burke commented Jan 4, 2026

Instead of calling countTokens() multiple times during binary search for chunk boundaries, tokenize the document once upfront and slice token arrays. This reduces tokenizer calls from O(chunks × iterations) to O(1) per document.

Reduces full rechunking on my collections from ~70s to ~20s

Instead of calling countTokens() multiple times during binary search
for chunk boundaries, tokenize the document once upfront and slice
token arrays. This reduces tokenizer calls from O(chunks × iterations)
to O(1) per document.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@tobi tobi merged commit fe0fd08 into tobi:main Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants