-
-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Summary
This issue proposes completing the SentencePieceTokenizer introduced in #17. The class currently supports training only. It is missing encode(), decode(), load(), input validation, special tokens, and directory creation
Background
PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the two implementations.
However the current SentencePieceTokenizer only covers the training step
A tokenizer that cannot encode or decode text cannot be used in any downstream pipeline task.
Problems in Current Implementation
1. No encode() or decode()
2. No load()
3. Missing special tokens in training
4. No input validation
5. save_path directory never created
Why This Matters
SentencePiece is used by major modern LLMs:
Without a complete SentencePieceTokenizer, the project cannot support cryptographic verification of pipelines
built on any of these models — which represent the majority of current open source LLMs.
Scope
This is a change touching only:
- sentencepiece_tokenizer.py
No overlap with any existing open PRs.
Related
- PR feat: deterministic tokenizer training and config hashing #17 — feat: deterministic tokenizer training and config hashing
Additional Context
No response
Code of Conduct
- I have joined the Discord server and will post updates there
- I have searched existing issues to avoid duplicates