feat: deterministic streaming dataset tokenization by Varshiniputtabakula · Pull Request #72 · AOSSIE-Org/OpenVerifiableLLM

Varshiniputtabakula · 2026-03-14T05:50:01Z

This PR implements deterministic dataset tokenization using the trained tokenizer.

Features:

streaming tokenization for large datasets
deterministic encoding
binary token storage
additional tokenizer tests

This bridges the gap between tokenizer training and downstream training infrastructure.

All existing tests pass and tokenizer tests are consolidated in tests/test_tokenizer.py.

Summary by CodeRabbit

New Features
- Added a dataset tokenization utility that streams text, encodes lines into token ID binaries efficiently and deterministically.
- Made the tokenization utility accessible from the package for easier use.
Dependencies
- Added numpy as a runtime dependency.
Tests
- Added tests covering output creation, determinism, missing/empty input handling.

coderabbitai · 2026-03-14T05:50:14Z

Walkthrough

Adds a streaming tokenize_dataset utility that encodes text lines with a provided tokenizer and writes token ID arrays to a binary file; re-exports the function at package level, adds numpy dependency, and introduces tests covering functionality and determinism.

Changes

Cohort / File(s)	Summary
Tokenization Utility `openverifiablellm/tokenizer/tokenize_dataset.py`	New streaming function `tokenize_dataset(input_file, tokenizer, output_file)` that reads input lines, skips empties, encodes via `tokenizer.encode`, normalizes token IDs, converts to `uint32` numpy arrays, writes binary output, and returns total tokens written.
Package Export `openverifiablellm/tokenizer/__init__.py`	Imports and adds `tokenize_dataset` to package `__all__`, exposing it at the package level.
Dependencies `pyproject.toml`	Adds `numpy>=1.20` to project dependencies.
Tests `tests/test_tokenizer.py`	Adds tests for output creation, determinism, missing input file error, and empty-input behavior using a `DummyTokenizer` and numpy for binary comparisons.

Sequence Diagram

sequenceDiagram
    actor User
    participant "File I/O" as FileIO
    participant "tokenize_dataset" as Tokenize
    participant Tokenizer
    participant NumPy
    participant "Output File" as OutFile

    User->>Tokenize: call tokenize_dataset(input, tokenizer, output)
    Tokenize->>FileIO: check input exists
    FileIO-->>Tokenize: exists / raise FileNotFoundError
    Tokenize->>FileIO: open and stream input file
    loop per line
        FileIO-->>Tokenize: line
        Tokenize->>Tokenize: strip & skip empty
        Tokenize->>Tokenizer: encode(text)
        Tokenizer-->>Tokenize: token IDs (list or object.ids)
        Tokenize->>Tokenize: normalize IDs to list
        Tokenize->>NumPy: create uint32 array from IDs
        NumPy-->>Tokenize: array
        Tokenize->>OutFile: write array to binary
    end
    Tokenize-->>User: return total token count

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Poem

🐇 I hop through lines both short and long,

I nibble tokens, tidy and strong,
I write them down in binary song,
Deterministic paw-steps all along,
Hooray — new token trails to romp and prong!

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: deterministic streaming dataset tokenization' accurately describes the main change: adding a new tokenize_dataset function with streaming capabilities and deterministic behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copilot

Pull request overview

Adds a dataset tokenization helper to the openverifiablellm.tokenizer package and introduces tests verifying output creation and determinism.

Changes:

Add tokenize_dataset() to stream-tokenize a text dataset into a binary token file.
Add unit tests for tokenize_dataset() output existence and deterministic output.
Add NumPy as a project dependency (used for binary token writing/reading).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`openverifiablellm/tokenizer/tokenize_dataset.py`	New streaming tokenization helper that writes `uint32` tokens to a binary file.
`openverifiablellm/tokenizer/__init__.py`	Re-exports `tokenize_dataset` from the tokenizer package.
`tests/test_tokenizer.py`	Adds tests for the new dataset tokenization helper.
`pyproject.toml`	Adds NumPy dependency to support token binary I/O.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

tests/test_tokenizer.py

+from pathlib import Path
+import tempfile
+import numpy as np


tests/test_tokenizer.py

+
+
+
+
+from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset
+


tests/test_tokenizer.py

+
+
+
+from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset


openverifiablellm/tokenizer/__init__.py

@@ -4,3 +4,5 @@
    "train_tokenizer",
    "hash_tokenizer_config",


openverifiablellm/tokenizer/tokenize_dataset.py

+    input_path = Path(input_file)
+    output_path = Path(output_file)
+
+    if not input_path.exists():
+        raise FileNotFoundError(f"Dataset file not found: {input_path}")
+
+    total_tokens = 0
+
+    # open dataset for streaming
+    with input_path.open("r", encoding="utf-8") as fin, \
+         output_path.open("wb") as fout:
+


openverifiablellm/tokenizer/tokenize_dataset.py

+from pathlib import Path
+import numpy as np


pyproject.toml

 dependencies = [
    "defusedxml",
    "sentencepiece",
-    "tokenizers==0.15.2"
+    "tokenizers==0.15.2",
+    "numpy"
 ]


coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

openverifiablellm/tokenizer/__init__.py (1)
3-8: ⚠️ Potential issue | 🟡 Minor

Add tokenize_dataset to __all__ for consistency.

The tokenize_dataset function is imported but not added to __all__. This makes it inconsistent with train_tokenizer and hash_tokenizer_config, and it won't be included in wildcard imports.
🔧 Proposed fix
 from .train import hash_tokenizer_config, train_tokenizer
+from .tokenize_dataset import tokenize_dataset

 __all__ = [
     "train_tokenizer",
     "hash_tokenizer_config",
+    "tokenize_dataset",
 ]
-
-from .tokenize_dataset import tokenize_dataset
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/__init__.py` around lines 3 - 8, The __all__ list
is missing the exported symbol tokenize_dataset; update the module's __all__ to
include "tokenize_dataset" alongside "train_tokenizer" and
"hash_tokenizer_config" so that tokenize_dataset is exported for wildcard
imports and remains consistent with the import from .tokenize_dataset.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/tokenize_dataset.py`:
- Around line 50-54: The code handling tokenizer outputs in tokenize_dataset.py
currently assumes encoded is a list or has an ids attribute (assigned to
tokens); add explicit validation after calling tokenizer.encode() (check
isinstance(encoded, list) first, then hasattr(encoded, "ids") and that
encoded.ids is a list/sequence) and if neither condition is met raise a clear
TypeError (or ValueError) that includes the actual type/value of encoded and
mentions tokenizer.encode() so callers know the return shape is unsupported;
update any callers or unit tests relying on encoded to expect this error path.

In `@pyproject.toml`:
- Around line 17-18: The pyproject dependency list currently pins
"tokenizers==0.15.2" but leaves "numpy" unversioned; update the numpy
declaration to include a minimum version (e.g., change "numpy" to "numpy>=1.20")
in the pyproject.toml dependencies to improve reproducibility and align with the
existing pinning practice.

In `@tests/test_tokenizer.py`:
- Around line 171-175: Move the stray import of tokenize_dataset into the module
import block at the top of tests/test_tokenizer.py so it sits with the other
openverifiablellm imports (e.g., alongside hash_tokenizer_config and
train_tokenizer); remove the duplicate import lines currently at lines 171-175
and add a single line "from openverifiablellm.tokenizer.tokenize_dataset import
tokenize_dataset" to the grouped imports at the top of the file to satisfy PEP8
and keep imports organized.
- Around line 183-214: Add tests to cover error and edge cases for
tokenize_dataset: add a test that calls tokenize_dataset with a non-existent
Path and asserts it raises FileNotFoundError (mirroring train_tokenizer
behavior), a test that writes an empty file and asserts tokenize_dataset returns
0 and produces an empty output file, and a test that exercises the branch where
the tokenizer returns an object with an .ids attribute (use DummyTokenizer or a
small stub that returns an object with .ids) to verify correct
output/determinism; reference the tokenize_dataset function and DummyTokenizer
test fixtures in tests/test_tokenizer.py when adding these cases.
- Around line 4-6: The file imports an unused module: remove the standalone
"import tempfile" import from the top-level imports (alongside "from pathlib
import Path" and "import numpy as np") in tests/test_tokenizer.py so the file no
longer contains the unused symbol "tempfile" and relies on pytest's tmp_path
fixture instead.
- Line 186: The call to dataset.write_text("hello\nworld") should explicitly
pass encoding="utf-8" to match tokenize_dataset's file reading and ensure
consistent cross-platform behavior; update the dataset.write_text invocation
(and the other write_text call in the same test) to include encoding="utf-8" so
both writes use UTF-8 encoding.

---

Outside diff comments:
In `@openverifiablellm/tokenizer/__init__.py`:
- Around line 3-8: The __all__ list is missing the exported symbol
tokenize_dataset; update the module's __all__ to include "tokenize_dataset"
alongside "train_tokenizer" and "hash_tokenizer_config" so that tokenize_dataset
is exported for wildcard imports and remains consistent with the import from
.tokenize_dataset.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9860997b-5f54-4164-8cd9-1bd8e68eaf40

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and a8473a3.

📒 Files selected for processing (4)

openverifiablellm/tokenizer/__init__.py
openverifiablellm/tokenizer/tokenize_dataset.py
pyproject.toml
tests/test_tokenizer.py

openverifiablellm/tokenizer/tokenize_dataset.py

pyproject.toml

tests/test_tokenizer.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/test_tokenizer.py (1)

174-230: 🧹 Nitpick | 🔵 Trivial

Add one test for tokenizer outputs that expose .ids.

Current DummyTokenizer only exercises the list-return path. Please add a small stub returning an object with .ids so the normalization branch in tokenize_dataset is covered.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_tokenizer.py` around lines 174 - 230, Add a new test that
exercises the branch in tokenize_dataset where the tokenizer returns an object
with an .ids attribute: create a small stub class (e.g., DummyTokenizerWithIds)
whose encode(text) returns an object with an .ids list/array of token ints, then
call tokenize_dataset(dataset, DummyTokenizerWithIds(), output) and assert the
produced output file and token counts match expectations (including
deterministic behavior if desired); reference tokenize_dataset and
DummyTokenizer to locate the relevant tests and ensure the new test mirrors
existing checks (file existence, total tokens, and binary contents) but using
the .ids-returning stub to cover that normalization branch.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/tokenize_dataset.py`:
- Line 5: tokenize_dataset currently calls tokenizer.encode(...) which doesn't
exist on the repository tokenizers; update tokenize_dataset to handle the
tokenizer interface robustly: check for methods in order (hasattr(tokenizer,
"encode") -> use it; elif hasattr(tokenizer, "tokenize") -> call
tokenizer.tokenize(...) and then convert tokens to ids via
tokenizer.convert_tokens_to_ids or tokenizer.tokens_to_ids if present; elif
hasattr(tokenizer, "encode_batch") use that; otherwise raise a clear error
mentioning tokenize_dataset and the missing methods. Ensure you reference the
tokenizer instance and preserve behavior for batching and special tokens when
available.

---

Duplicate comments:
In `@tests/test_tokenizer.py`:
- Around line 174-230: Add a new test that exercises the branch in
tokenize_dataset where the tokenizer returns an object with an .ids attribute:
create a small stub class (e.g., DummyTokenizerWithIds) whose encode(text)
returns an object with an .ids list/array of token ints, then call
tokenize_dataset(dataset, DummyTokenizerWithIds(), output) and assert the
produced output file and token counts match expectations (including
deterministic behavior if desired); reference tokenize_dataset and
DummyTokenizer to locate the relevant tests and ensure the new test mirrors
existing checks (file existence, total tokens, and binary contents) but using
the .ids-returning stub to cover that normalization branch.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7b76a18-8007-4760-9d23-74752a2b7744

📥 Commits

Reviewing files that changed from the base of the PR and between a8473a3 and 5532f40.

📒 Files selected for processing (4)

openverifiablellm/tokenizer/__init__.py
openverifiablellm/tokenizer/tokenize_dataset.py
pyproject.toml
tests/test_tokenizer.py

coderabbitai · 2026-03-16T11:29:24Z

openverifiablellm/tokenizer/tokenize_dataset.py

+import numpy as np
+
+
+def tokenize_dataset(input_file, tokenizer, output_file):


⚠️ Potential issue | 🟠 Major

tokenize_dataset is not compatible with the repository tokenizer classes.

Line 47 assumes tokenizer.encode(...) exists, but openverifiablellm/tokenizer/factory.py returns tokenizer classes (e.g., openverifiablellm/tokenizer/bpe_tokenizer.py::BPETokenizer) that do not expose encode. This will fail at runtime for the package’s own tokenizer instances and breaks the intended trained-tokenizer workflow.

Suggested hardening (clear failure mode in this function)

def tokenize_dataset(input_file, tokenizer, output_file): + if not hasattr(tokenizer, "encode"): + raise TypeError( + f"tokenizer must expose encode(text); got {type(tokenizer).__name__}" + )

Also applies to: 47-47

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/tokenize_dataset.py` at line 5, tokenize_dataset currently calls tokenizer.encode(...) which doesn't exist on the repository tokenizers; update tokenize_dataset to handle the tokenizer interface robustly: check for methods in order (hasattr(tokenizer, "encode") -> use it; elif hasattr(tokenizer, "tokenize") -> call tokenizer.tokenize(...) and then convert tokens to ids via tokenizer.convert_tokens_to_ids or tokenizer.tokens_to_ids if present; elif hasattr(tokenizer, "encode_batch") use that; otherwise raise a clear error mentioning tokenize_dataset and the missing methods. Ensure you reference the tokenizer instance and preserve behavior for batching and special tokens when available.

feat: deterministic streaming dataset tokenization

a8473a3

Copilot AI review requested due to automatic review settings March 14, 2026 05:50

github-actions bot added no-issue-linked backend configuration python labels Mar 14, 2026

github-actions bot added size/M first-time-contributor pending-coderabbit-review labels Mar 14, 2026

Copilot started reviewing on behalf of Varshiniputtabakula March 14, 2026 05:50 View session

github-actions bot added size/M and removed size/M labels Mar 14, 2026

Copilot AI reviewed Mar 14, 2026

View reviewed changes

coderabbitai bot requested changes Mar 14, 2026

View reviewed changes

fix: address CodeRabbit review comments for dataset tokenization

5532f40

github-actions bot added size/M and removed size/M labels Mar 16, 2026

coderabbitai bot requested changes Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: deterministic streaming dataset tokenization#72

feat: deterministic streaming dataset tokenization#72
Varshiniputtabakula wants to merge 2 commits intoAOSSIE-Org:mainfrom
Varshiniputtabakula:feat-dataset-tokenization-clean

Varshiniputtabakula commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants





		from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset

		import numpy as np


		def tokenize_dataset(input_file, tokenizer, output_file):

Uh oh!

Conversation

Varshiniputtabakula commented Mar 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Varshiniputtabakula commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 14, 2026 •

edited

Loading