Skip to content

feat: deterministic streaming dataset tokenization#72

Open
Varshiniputtabakula wants to merge 2 commits intoAOSSIE-Org:mainfrom
Varshiniputtabakula:feat-dataset-tokenization-clean
Open

feat: deterministic streaming dataset tokenization#72
Varshiniputtabakula wants to merge 2 commits intoAOSSIE-Org:mainfrom
Varshiniputtabakula:feat-dataset-tokenization-clean

Conversation

@Varshiniputtabakula
Copy link

@Varshiniputtabakula Varshiniputtabakula commented Mar 14, 2026

This PR implements deterministic dataset tokenization using the trained tokenizer.

Features:

  • streaming tokenization for large datasets
  • deterministic encoding
  • binary token storage
  • additional tokenizer tests

This bridges the gap between tokenizer training and downstream training infrastructure.

All existing tests pass and tokenizer tests are consolidated in tests/test_tokenizer.py.

Summary by CodeRabbit

  • New Features

    • Added a dataset tokenization utility that streams text, encodes lines into token ID binaries efficiently and deterministically.
    • Made the tokenization utility accessible from the package for easier use.
  • Dependencies

    • Added numpy as a runtime dependency.
  • Tests

    • Added tests covering output creation, determinism, missing/empty input handling.

Copilot AI review requested due to automatic review settings March 14, 2026 05:50
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

Walkthrough

Adds a streaming tokenize_dataset utility that encodes text lines with a provided tokenizer and writes token ID arrays to a binary file; re-exports the function at package level, adds numpy dependency, and introduces tests covering functionality and determinism.

Changes

Cohort / File(s) Summary
Tokenization Utility
openverifiablellm/tokenizer/tokenize_dataset.py
New streaming function tokenize_dataset(input_file, tokenizer, output_file) that reads input lines, skips empties, encodes via tokenizer.encode, normalizes token IDs, converts to uint32 numpy arrays, writes binary output, and returns total tokens written.
Package Export
openverifiablellm/tokenizer/__init__.py
Imports and adds tokenize_dataset to package __all__, exposing it at the package level.
Dependencies
pyproject.toml
Adds numpy>=1.20 to project dependencies.
Tests
tests/test_tokenizer.py
Adds tests for output creation, determinism, missing input file error, and empty-input behavior using a DummyTokenizer and numpy for binary comparisons.

Sequence Diagram

sequenceDiagram
    actor User
    participant "File I/O" as FileIO
    participant "tokenize_dataset" as Tokenize
    participant Tokenizer
    participant NumPy
    participant "Output File" as OutFile

    User->>Tokenize: call tokenize_dataset(input, tokenizer, output)
    Tokenize->>FileIO: check input exists
    FileIO-->>Tokenize: exists / raise FileNotFoundError
    Tokenize->>FileIO: open and stream input file
    loop per line
        FileIO-->>Tokenize: line
        Tokenize->>Tokenize: strip & skip empty
        Tokenize->>Tokenizer: encode(text)
        Tokenizer-->>Tokenize: token IDs (list or object.ids)
        Tokenize->>Tokenize: normalize IDs to list
        Tokenize->>NumPy: create uint32 array from IDs
        NumPy-->>Tokenize: array
        Tokenize->>OutFile: write array to binary
    end
    Tokenize-->>User: return total token count
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Poem

🐇 I hop through lines both short and long,

I nibble tokens, tidy and strong,
I write them down in binary song,
Deterministic paw-steps all along,
Hooray — new token trails to romp and prong!

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: deterministic streaming dataset tokenization' accurately describes the main change: adding a new tokenize_dataset function with streaming capabilities and deterministic behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dataset tokenization helper to the openverifiablellm.tokenizer package and introduces tests verifying output creation and determinism.

Changes:

  • Add tokenize_dataset() to stream-tokenize a text dataset into a binary token file.
  • Add unit tests for tokenize_dataset() output existence and deterministic output.
  • Add NumPy as a project dependency (used for binary token writing/reading).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
openverifiablellm/tokenizer/tokenize_dataset.py New streaming tokenization helper that writes uint32 tokens to a binary file.
openverifiablellm/tokenizer/__init__.py Re-exports tokenize_dataset from the tokenizer package.
tests/test_tokenizer.py Adds tests for the new dataset tokenization helper.
pyproject.toml Adds NumPy dependency to support token binary I/O.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +4 to +6
from pathlib import Path
import tempfile
import numpy as np
Comment on lines +171 to +176




from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset




from openverifiablellm.tokenizer.tokenize_dataset import tokenize_dataset
@@ -4,3 +4,5 @@
"train_tokenizer",
"hash_tokenizer_config",
Comment on lines +29 to +40
input_path = Path(input_file)
output_path = Path(output_file)

if not input_path.exists():
raise FileNotFoundError(f"Dataset file not found: {input_path}")

total_tokens = 0

# open dataset for streaming
with input_path.open("r", encoding="utf-8") as fin, \
output_path.open("wb") as fout:

Comment on lines +1 to +2
from pathlib import Path
import numpy as np
Comment on lines 14 to 19
dependencies = [
"defusedxml",
"sentencepiece",
"tokenizers==0.15.2"
"tokenizers==0.15.2",
"numpy"
]
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/tokenizer/__init__.py (1)

3-8: ⚠️ Potential issue | 🟡 Minor

Add tokenize_dataset to __all__ for consistency.

The tokenize_dataset function is imported but not added to __all__. This makes it inconsistent with train_tokenizer and hash_tokenizer_config, and it won't be included in wildcard imports.

🔧 Proposed fix
 from .train import hash_tokenizer_config, train_tokenizer
+from .tokenize_dataset import tokenize_dataset

 __all__ = [
     "train_tokenizer",
     "hash_tokenizer_config",
+    "tokenize_dataset",
 ]
-
-from .tokenize_dataset import tokenize_dataset
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/__init__.py` around lines 3 - 8, The __all__ list
is missing the exported symbol tokenize_dataset; update the module's __all__ to
include "tokenize_dataset" alongside "train_tokenizer" and
"hash_tokenizer_config" so that tokenize_dataset is exported for wildcard
imports and remains consistent with the import from .tokenize_dataset.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/tokenize_dataset.py`:
- Around line 50-54: The code handling tokenizer outputs in tokenize_dataset.py
currently assumes encoded is a list or has an ids attribute (assigned to
tokens); add explicit validation after calling tokenizer.encode() (check
isinstance(encoded, list) first, then hasattr(encoded, "ids") and that
encoded.ids is a list/sequence) and if neither condition is met raise a clear
TypeError (or ValueError) that includes the actual type/value of encoded and
mentions tokenizer.encode() so callers know the return shape is unsupported;
update any callers or unit tests relying on encoded to expect this error path.

In `@pyproject.toml`:
- Around line 17-18: The pyproject dependency list currently pins
"tokenizers==0.15.2" but leaves "numpy" unversioned; update the numpy
declaration to include a minimum version (e.g., change "numpy" to "numpy>=1.20")
in the pyproject.toml dependencies to improve reproducibility and align with the
existing pinning practice.

In `@tests/test_tokenizer.py`:
- Around line 171-175: Move the stray import of tokenize_dataset into the module
import block at the top of tests/test_tokenizer.py so it sits with the other
openverifiablellm imports (e.g., alongside hash_tokenizer_config and
train_tokenizer); remove the duplicate import lines currently at lines 171-175
and add a single line "from openverifiablellm.tokenizer.tokenize_dataset import
tokenize_dataset" to the grouped imports at the top of the file to satisfy PEP8
and keep imports organized.
- Around line 183-214: Add tests to cover error and edge cases for
tokenize_dataset: add a test that calls tokenize_dataset with a non-existent
Path and asserts it raises FileNotFoundError (mirroring train_tokenizer
behavior), a test that writes an empty file and asserts tokenize_dataset returns
0 and produces an empty output file, and a test that exercises the branch where
the tokenizer returns an object with an .ids attribute (use DummyTokenizer or a
small stub that returns an object with .ids) to verify correct
output/determinism; reference the tokenize_dataset function and DummyTokenizer
test fixtures in tests/test_tokenizer.py when adding these cases.
- Around line 4-6: The file imports an unused module: remove the standalone
"import tempfile" import from the top-level imports (alongside "from pathlib
import Path" and "import numpy as np") in tests/test_tokenizer.py so the file no
longer contains the unused symbol "tempfile" and relies on pytest's tmp_path
fixture instead.
- Line 186: The call to dataset.write_text("hello\nworld") should explicitly
pass encoding="utf-8" to match tokenize_dataset's file reading and ensure
consistent cross-platform behavior; update the dataset.write_text invocation
(and the other write_text call in the same test) to include encoding="utf-8" so
both writes use UTF-8 encoding.

---

Outside diff comments:
In `@openverifiablellm/tokenizer/__init__.py`:
- Around line 3-8: The __all__ list is missing the exported symbol
tokenize_dataset; update the module's __all__ to include "tokenize_dataset"
alongside "train_tokenizer" and "hash_tokenizer_config" so that tokenize_dataset
is exported for wildcard imports and remains consistent with the import from
.tokenize_dataset.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9860997b-5f54-4164-8cd9-1bd8e68eaf40

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and a8473a3.

📒 Files selected for processing (4)
  • openverifiablellm/tokenizer/__init__.py
  • openverifiablellm/tokenizer/tokenize_dataset.py
  • pyproject.toml
  • tests/test_tokenizer.py

@github-actions github-actions bot added size/M and removed size/M labels Mar 16, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/test_tokenizer.py (1)

174-230: 🧹 Nitpick | 🔵 Trivial

Add one test for tokenizer outputs that expose .ids.

Current DummyTokenizer only exercises the list-return path. Please add a small stub returning an object with .ids so the normalization branch in tokenize_dataset is covered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_tokenizer.py` around lines 174 - 230, Add a new test that
exercises the branch in tokenize_dataset where the tokenizer returns an object
with an .ids attribute: create a small stub class (e.g., DummyTokenizerWithIds)
whose encode(text) returns an object with an .ids list/array of token ints, then
call tokenize_dataset(dataset, DummyTokenizerWithIds(), output) and assert the
produced output file and token counts match expectations (including
deterministic behavior if desired); reference tokenize_dataset and
DummyTokenizer to locate the relevant tests and ensure the new test mirrors
existing checks (file existence, total tokens, and binary contents) but using
the .ids-returning stub to cover that normalization branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/tokenize_dataset.py`:
- Line 5: tokenize_dataset currently calls tokenizer.encode(...) which doesn't
exist on the repository tokenizers; update tokenize_dataset to handle the
tokenizer interface robustly: check for methods in order (hasattr(tokenizer,
"encode") -> use it; elif hasattr(tokenizer, "tokenize") -> call
tokenizer.tokenize(...) and then convert tokens to ids via
tokenizer.convert_tokens_to_ids or tokenizer.tokens_to_ids if present; elif
hasattr(tokenizer, "encode_batch") use that; otherwise raise a clear error
mentioning tokenize_dataset and the missing methods. Ensure you reference the
tokenizer instance and preserve behavior for batching and special tokens when
available.

---

Duplicate comments:
In `@tests/test_tokenizer.py`:
- Around line 174-230: Add a new test that exercises the branch in
tokenize_dataset where the tokenizer returns an object with an .ids attribute:
create a small stub class (e.g., DummyTokenizerWithIds) whose encode(text)
returns an object with an .ids list/array of token ints, then call
tokenize_dataset(dataset, DummyTokenizerWithIds(), output) and assert the
produced output file and token counts match expectations (including
deterministic behavior if desired); reference tokenize_dataset and
DummyTokenizer to locate the relevant tests and ensure the new test mirrors
existing checks (file existence, total tokens, and binary contents) but using
the .ids-returning stub to cover that normalization branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7b76a18-8007-4760-9d23-74752a2b7744

📥 Commits

Reviewing files that changed from the base of the PR and between a8473a3 and 5532f40.

📒 Files selected for processing (4)
  • openverifiablellm/tokenizer/__init__.py
  • openverifiablellm/tokenizer/tokenize_dataset.py
  • pyproject.toml
  • tests/test_tokenizer.py

import numpy as np


def tokenize_dataset(input_file, tokenizer, output_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

tokenize_dataset is not compatible with the repository tokenizer classes.

Line 47 assumes tokenizer.encode(...) exists, but openverifiablellm/tokenizer/factory.py returns tokenizer classes (e.g., openverifiablellm/tokenizer/bpe_tokenizer.py::BPETokenizer) that do not expose encode. This will fail at runtime for the package’s own tokenizer instances and breaks the intended trained-tokenizer workflow.

Suggested hardening (clear failure mode in this function)
 def tokenize_dataset(input_file, tokenizer, output_file):
+    if not hasattr(tokenizer, "encode"):
+        raise TypeError(
+            f"tokenizer must expose encode(text); got {type(tokenizer).__name__}"
+        )

Also applies to: 47-47

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/tokenize_dataset.py` at line 5, tokenize_dataset
currently calls tokenizer.encode(...) which doesn't exist on the repository
tokenizers; update tokenize_dataset to handle the tokenizer interface robustly:
check for methods in order (hasattr(tokenizer, "encode") -> use it; elif
hasattr(tokenizer, "tokenize") -> call tokenizer.tokenize(...) and then convert
tokens to ids via tokenizer.convert_tokens_to_ids or tokenizer.tokens_to_ids if
present; elif hasattr(tokenizer, "encode_batch") use that; otherwise raise a
clear error mentioning tokenize_dataset and the missing methods. Ensure you
reference the tokenizer instance and preserve behavior for batching and special
tokens when available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants