feat: deterministic streaming dataset tokenization#72
feat: deterministic streaming dataset tokenization#72Varshiniputtabakula wants to merge 2 commits intoAOSSIE-Org:mainfrom
Conversation
WalkthroughAdds a streaming Changes
Sequence DiagramsequenceDiagram
actor User
participant "File I/O" as FileIO
participant "tokenize_dataset" as Tokenize
participant Tokenizer
participant NumPy
participant "Output File" as OutFile
User->>Tokenize: call tokenize_dataset(input, tokenizer, output)
Tokenize->>FileIO: check input exists
FileIO-->>Tokenize: exists / raise FileNotFoundError
Tokenize->>FileIO: open and stream input file
loop per line
FileIO-->>Tokenize: line
Tokenize->>Tokenize: strip & skip empty
Tokenize->>Tokenizer: encode(text)
Tokenizer-->>Tokenize: token IDs (list or object.ids)
Tokenize->>Tokenize: normalize IDs to list
Tokenize->>NumPy: create uint32 array from IDs
NumPy-->>Tokenize: array
Tokenize->>OutFile: write array to binary
end
Tokenize-->>User: return total token count
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
There was a problem hiding this comment.
Pull request overview
Adds a dataset tokenization helper to the openverifiablellm.tokenizer package and introduces tests verifying output creation and determinism.
Changes:
- Add
tokenize_dataset()to stream-tokenize a text dataset into a binary token file. - Add unit tests for
tokenize_dataset()output existence and deterministic output. - Add NumPy as a project dependency (used for binary token writing/reading).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
openverifiablellm/tokenizer/tokenize_dataset.py |
New streaming tokenization helper that writes uint32 tokens to a binary file. |
openverifiablellm/tokenizer/__init__.py |
Re-exports tokenize_dataset from the tokenizer package. |
tests/test_tokenizer.py |
Adds tests for the new dataset tokenization helper. |
pyproject.toml |
Adds NumPy dependency to support token binary I/O. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| from pathlib import Path | ||
| import tempfile | ||
| import numpy as np |
tests/test_tokenizer.py
Outdated
|
|
||
|
|
||
|
|
||
|
|
||
| from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset | ||
|
|
tests/test_tokenizer.py
Outdated
|
|
||
|
|
||
|
|
||
| from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset |
| @@ -4,3 +4,5 @@ | |||
| "train_tokenizer", | |||
| "hash_tokenizer_config", | |||
| input_path = Path(input_file) | ||
| output_path = Path(output_file) | ||
|
|
||
| if not input_path.exists(): | ||
| raise FileNotFoundError(f"Dataset file not found: {input_path}") | ||
|
|
||
| total_tokens = 0 | ||
|
|
||
| # open dataset for streaming | ||
| with input_path.open("r", encoding="utf-8") as fin, \ | ||
| output_path.open("wb") as fout: | ||
|
|
| from pathlib import Path | ||
| import numpy as np |
| dependencies = [ | ||
| "defusedxml", | ||
| "sentencepiece", | ||
| "tokenizers==0.15.2" | ||
| "tokenizers==0.15.2", | ||
| "numpy" | ||
| ] |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/__init__.py (1)
3-8:⚠️ Potential issue | 🟡 MinorAdd
tokenize_datasetto__all__for consistency.The
tokenize_datasetfunction is imported but not added to__all__. This makes it inconsistent withtrain_tokenizerandhash_tokenizer_config, and it won't be included in wildcard imports.🔧 Proposed fix
from .train import hash_tokenizer_config, train_tokenizer +from .tokenize_dataset import tokenize_dataset __all__ = [ "train_tokenizer", "hash_tokenizer_config", + "tokenize_dataset", ] - -from .tokenize_dataset import tokenize_dataset🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/__init__.py` around lines 3 - 8, The __all__ list is missing the exported symbol tokenize_dataset; update the module's __all__ to include "tokenize_dataset" alongside "train_tokenizer" and "hash_tokenizer_config" so that tokenize_dataset is exported for wildcard imports and remains consistent with the import from .tokenize_dataset.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/tokenize_dataset.py`:
- Around line 50-54: The code handling tokenizer outputs in tokenize_dataset.py
currently assumes encoded is a list or has an ids attribute (assigned to
tokens); add explicit validation after calling tokenizer.encode() (check
isinstance(encoded, list) first, then hasattr(encoded, "ids") and that
encoded.ids is a list/sequence) and if neither condition is met raise a clear
TypeError (or ValueError) that includes the actual type/value of encoded and
mentions tokenizer.encode() so callers know the return shape is unsupported;
update any callers or unit tests relying on encoded to expect this error path.
In `@pyproject.toml`:
- Around line 17-18: The pyproject dependency list currently pins
"tokenizers==0.15.2" but leaves "numpy" unversioned; update the numpy
declaration to include a minimum version (e.g., change "numpy" to "numpy>=1.20")
in the pyproject.toml dependencies to improve reproducibility and align with the
existing pinning practice.
In `@tests/test_tokenizer.py`:
- Around line 171-175: Move the stray import of tokenize_dataset into the module
import block at the top of tests/test_tokenizer.py so it sits with the other
openverifiablellm imports (e.g., alongside hash_tokenizer_config and
train_tokenizer); remove the duplicate import lines currently at lines 171-175
and add a single line "from openverifiablellm.tokenizer.tokenize_dataset import
tokenize_dataset" to the grouped imports at the top of the file to satisfy PEP8
and keep imports organized.
- Around line 183-214: Add tests to cover error and edge cases for
tokenize_dataset: add a test that calls tokenize_dataset with a non-existent
Path and asserts it raises FileNotFoundError (mirroring train_tokenizer
behavior), a test that writes an empty file and asserts tokenize_dataset returns
0 and produces an empty output file, and a test that exercises the branch where
the tokenizer returns an object with an .ids attribute (use DummyTokenizer or a
small stub that returns an object with .ids) to verify correct
output/determinism; reference the tokenize_dataset function and DummyTokenizer
test fixtures in tests/test_tokenizer.py when adding these cases.
- Around line 4-6: The file imports an unused module: remove the standalone
"import tempfile" import from the top-level imports (alongside "from pathlib
import Path" and "import numpy as np") in tests/test_tokenizer.py so the file no
longer contains the unused symbol "tempfile" and relies on pytest's tmp_path
fixture instead.
- Line 186: The call to dataset.write_text("hello\nworld") should explicitly
pass encoding="utf-8" to match tokenize_dataset's file reading and ensure
consistent cross-platform behavior; update the dataset.write_text invocation
(and the other write_text call in the same test) to include encoding="utf-8" so
both writes use UTF-8 encoding.
---
Outside diff comments:
In `@openverifiablellm/tokenizer/__init__.py`:
- Around line 3-8: The __all__ list is missing the exported symbol
tokenize_dataset; update the module's __all__ to include "tokenize_dataset"
alongside "train_tokenizer" and "hash_tokenizer_config" so that tokenize_dataset
is exported for wildcard imports and remains consistent with the import from
.tokenize_dataset.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9860997b-5f54-4164-8cd9-1bd8e68eaf40
📒 Files selected for processing (4)
openverifiablellm/tokenizer/__init__.pyopenverifiablellm/tokenizer/tokenize_dataset.pypyproject.tomltests/test_tokenizer.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/test_tokenizer.py (1)
174-230: 🧹 Nitpick | 🔵 TrivialAdd one test for tokenizer outputs that expose
.ids.Current
DummyTokenizeronly exercises the list-return path. Please add a small stub returning an object with.idsso the normalization branch intokenize_datasetis covered.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_tokenizer.py` around lines 174 - 230, Add a new test that exercises the branch in tokenize_dataset where the tokenizer returns an object with an .ids attribute: create a small stub class (e.g., DummyTokenizerWithIds) whose encode(text) returns an object with an .ids list/array of token ints, then call tokenize_dataset(dataset, DummyTokenizerWithIds(), output) and assert the produced output file and token counts match expectations (including deterministic behavior if desired); reference tokenize_dataset and DummyTokenizer to locate the relevant tests and ensure the new test mirrors existing checks (file existence, total tokens, and binary contents) but using the .ids-returning stub to cover that normalization branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/tokenize_dataset.py`:
- Line 5: tokenize_dataset currently calls tokenizer.encode(...) which doesn't
exist on the repository tokenizers; update tokenize_dataset to handle the
tokenizer interface robustly: check for methods in order (hasattr(tokenizer,
"encode") -> use it; elif hasattr(tokenizer, "tokenize") -> call
tokenizer.tokenize(...) and then convert tokens to ids via
tokenizer.convert_tokens_to_ids or tokenizer.tokens_to_ids if present; elif
hasattr(tokenizer, "encode_batch") use that; otherwise raise a clear error
mentioning tokenize_dataset and the missing methods. Ensure you reference the
tokenizer instance and preserve behavior for batching and special tokens when
available.
---
Duplicate comments:
In `@tests/test_tokenizer.py`:
- Around line 174-230: Add a new test that exercises the branch in
tokenize_dataset where the tokenizer returns an object with an .ids attribute:
create a small stub class (e.g., DummyTokenizerWithIds) whose encode(text)
returns an object with an .ids list/array of token ints, then call
tokenize_dataset(dataset, DummyTokenizerWithIds(), output) and assert the
produced output file and token counts match expectations (including
deterministic behavior if desired); reference tokenize_dataset and
DummyTokenizer to locate the relevant tests and ensure the new test mirrors
existing checks (file existence, total tokens, and binary contents) but using
the .ids-returning stub to cover that normalization branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c7b76a18-8007-4760-9d23-74752a2b7744
📒 Files selected for processing (4)
openverifiablellm/tokenizer/__init__.pyopenverifiablellm/tokenizer/tokenize_dataset.pypyproject.tomltests/test_tokenizer.py
| import numpy as np | ||
|
|
||
|
|
||
| def tokenize_dataset(input_file, tokenizer, output_file): |
There was a problem hiding this comment.
tokenize_dataset is not compatible with the repository tokenizer classes.
Line 47 assumes tokenizer.encode(...) exists, but openverifiablellm/tokenizer/factory.py returns tokenizer classes (e.g., openverifiablellm/tokenizer/bpe_tokenizer.py::BPETokenizer) that do not expose encode. This will fail at runtime for the package’s own tokenizer instances and breaks the intended trained-tokenizer workflow.
Suggested hardening (clear failure mode in this function)
def tokenize_dataset(input_file, tokenizer, output_file):
+ if not hasattr(tokenizer, "encode"):
+ raise TypeError(
+ f"tokenizer must expose encode(text); got {type(tokenizer).__name__}"
+ )Also applies to: 47-47
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/tokenizer/tokenize_dataset.py` at line 5, tokenize_dataset
currently calls tokenizer.encode(...) which doesn't exist on the repository
tokenizers; update tokenize_dataset to handle the tokenizer interface robustly:
check for methods in order (hasattr(tokenizer, "encode") -> use it; elif
hasattr(tokenizer, "tokenize") -> call tokenizer.tokenize(...) and then convert
tokens to ids via tokenizer.convert_tokens_to_ids or tokenizer.tokens_to_ids if
present; elif hasattr(tokenizer, "encode_batch") use that; otherwise raise a
clear error mentioning tokenize_dataset and the missing methods. Ensure you
reference the tokenizer instance and preserve behavior for batching and special
tokens when available.
This PR implements deterministic dataset tokenization using the trained tokenizer.
Features:
This bridges the gap between tokenizer training and downstream training infrastructure.
All existing tests pass and tokenizer tests are consolidated in tests/test_tokenizer.py.
Summary by CodeRabbit
New Features
Dependencies
Tests