Optimize chunking to tokenize once per document #9

burke · 2026-01-04T16:36:33Z

Instead of calling countTokens() multiple times during binary search for chunk boundaries, tokenize the document once upfront and slice token arrays. This reduces tokenizer calls from O(chunks × iterations) to O(1) per document.

Reduces full rechunking on my collections from ~70s to ~20s

Instead of calling countTokens() multiple times during binary search for chunk boundaries, tokenize the document once upfront and slice token arrays. This reduces tokenizer calls from O(chunks × iterations) to O(1) per document. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

tobi merged commit fe0fd08 into tobi:main Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize chunking to tokenize once per document #9

Optimize chunking to tokenize once per document #9

burke commented Jan 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize chunking to tokenize once per document #9

Optimize chunking to tokenize once per document #9

Conversation

burke commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

burke commented Jan 4, 2026 •

edited

Loading