[FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation


## Summary

This issue proposes completing the SentencePieceTokenizer introduced in #17. The class currently supports training only. It is missing  **encode**(), **decode**(), **load**(), **input validation,** special tokens, and directory creation  

 
## Background

PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the two implementations. 

However the current SentencePieceTokenizer only covers the training step
 
A tokenizer that cannot encode or decode text cannot be used in any downstream pipeline task.

 
## Problems in Current Implementation

### 1. No encode() or decode() 

### 2. No load()
 
### 3. Missing special tokens in training
 
### 4. No input validation 

### 5. save_path directory never created  

## Why This Matters

SentencePiece is used by major modern LLMs:
 

Without a complete SentencePieceTokenizer, the project cannot support cryptographic verification of pipelines
built on any of these models — which represent the majority of current open source LLMs.

 
## Scope

This is a change touching only:
- sentencepiece_tokenizer.py

No overlap with any existing open PRs. 
 

## Related

- PR #17 — feat: deterministic tokenizer training and config hashing 

### Additional Context

_No response_

### Code of Conduct

- [x] I have joined the [Discord server](https://discord.gg/hjUhu33uAn) and will post updates there
- [x] I have searched existing issues to avoid duplicates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation #51

Summary

Background

Problems in Current Implementation

1. No encode() or decode()

2. No load()

3. Missing special tokens in training

4. No input validation

5. save_path directory never created

Why This Matters

Scope

Related

Additional Context

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation #51

Description

Summary

Background

Problems in Current Implementation

1. No encode() or decode()

2. No load()

3. Missing special tokens in training

4. No input validation

5. save_path directory never created

Why This Matters

Scope

Related

Additional Context

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions