-
Notifications
You must be signed in to change notification settings - Fork 2
Yahya/token healing #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
making some changes for multi splits... |
|
O(|P|) per k (greedy scan), worst‑case O(|P|^2) across k — but |P| is small (bytes since last boundary), so this is negligible. |
|
@yahya010 Are you planning on fixing the code coverage? |
|
@benlebrun yeah, got distracted |
|
alright, is that enough? rest are verbose and such |
|
Hi, we are using your library in our work -- and it's amazing, many thanks! If possible, can you please let me know by when this feature can be merged? It's something that would be extremely useful for our work. |
|
Hey, thanks a lot! I need to clean and add tests for code coverage and should be good to be reviewed. It's ready to use as is from the branch if needed, but I'll get it finalized soon |
Adaptive token healing (@timvieira's idea) to handle any dead ends whete a tokenization makes the next byte unreachable. The boundary is moved earlier in the current partial bytes, commit an EOT, rematerialize and generate.
It first does a precheck O(len(P)) trie lookups and 0 LM calls, to find a valid earlier boundary where EOT + replay + target byte are available, then a materialize so only <=1 LM call needed.