Skip to content

[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof#78

Draft
Shubhamx404 wants to merge 36 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-tokenizer-tests
Draft

[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof#78
Shubhamx404 wants to merge 36 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-tokenizer-tests

Conversation

@Shubhamx404
Copy link
Contributor

@Shubhamx404 Shubhamx404 commented Mar 17, 2026

Addressed Issues:

Fixes #77

Overview

The repository currently lacks unit tests for two utility functions:
create_tokenizer in tests/test_tokenizer.py and load_merkle_proof in tests/test_util.py

  • This lack of test coverage creates blind spots, making the code harder to safely refactor and potentially allowing bugs to be introduced without being detected by our test suite.

Solution

Changes Made

1. Export create_tokenizer

Added create_tokenizer to openverifiablellm/tokenizer/__init__.py
Enables clean imports:

from openverifiablellm.tokenizer import create_tokenizer

Tests for create_tokenizer (tests/test_tokenizer.py)

  • Added unit tests to validate factory behavior:
  • Returns BPETokenizer when tokenizer_type="bpe"
  • Returns SentencePieceTokenizer when tokenizer_type="sentencepiece"
  • Raises ValueError for unsupported tokenizer types

Tests for load_merkle_proof (tests/test_util.py)

  • Successfully loads valid JSON Merkle proof
  • Raises FileNotFoundError for missing files
  • Raises json.JSONDecodeError for malformed JSON

Screenshots/Recordings:

ChatGPT Image Mar 17, 2026, 12_21_43 PM image image

Additional Notes:

Checklist

  • [ * ] My code follows the project's code style and conventions
  • [ * ] I have made corresponding changes to the documentation
  • [ * ] My changes generate no new warnings or errors
  • [ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added manifest chain hashing & verification to detect tampering and link manifests
    • Checkpointable resuming for large text extraction and improved manifest linkage
    • New CLIs to preprocess dumps and verify datasets; previous-manifest support
  • Improvements

    • Exposed a tokenizer factory function and silenced a third‑party deprecation warning during import
    • Added raw-bytes SHA256 helper and centralized hashing
  • Tests

    • Comprehensive tests for manifest chain, tokenizer factory, utilities, and verification flows
  • Chores

    • Added numpy to dev dependencies

@github-actions github-actions bot added size/L and removed size/L labels Mar 17, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/__init__.py`:
- Line 1: The package-level import eagerly re-exports create_tokenizer which
causes SentencePieceTokenizer (and the optional sentencepiece dependency) to be
imported at module import time; change __init__.py to expose create_tokenizer
via a lazy wrapper that imports create_tokenizer only when called, and modify
factory.py so SentencePieceTokenizer (and any import of sentencepiece) is
imported lazily inside create_tokenizer() only when the sentencepiece-backed
branch is selected; update references to the factory-level create_tokenizer and
the class name SentencePieceTokenizer to ensure the dynamic import path is used.

In `@pyproject.toml`:
- Around line 37-39: There are duplicate TOML tables for
[tool.pytest.ini_options] causing a CI TOML parse error; locate the existing
[tool.pytest.ini_options] table and merge the new keys (e.g. filterwarnings =
[...]) into that single table, then remove this duplicate table entry so only
one [tool.pytest.ini_options] block remains in pyproject.toml.
- Line 26: Move "numpy" out of the development dependency group and add it to
the project's core dependencies: remove "numpy" from dependency-groups.dev and
add it under the main dependencies section (the project dependencies block in
pyproject.toml). Update any existing dependency listing to ensure only one entry
for "numpy" remains (no duplicates) and run the lock/build step to regenerate
the lockfile so runtime installs include numpy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 54f61b50-1240-48c7-8bba-6da56fa2efee

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and 05c6b3d.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • openverifiablellm/tokenizer/__init__.py
  • openverifiablellm/tokenizer/sentencepiece_tokenizer.py
  • pyproject.toml
  • tests/test_tokenizer.py
  • tests/test_util.py
  • tests/test_verify.py

@@ -1,6 +1,8 @@
from .factory import create_tokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C2 --type=py 'from \.factory import create_tokenizer|from \.sentencepiece_tokenizer import SentencePieceTokenizer|import sentencepiece as spm'

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1413


🏁 Script executed:

cat -n openverifiablellm/tokenizer/factory.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 650


🏁 Script executed:

head -20 openverifiablellm/tokenizer/sentencepiece_tokenizer.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 649


🏁 Script executed:

cat pyproject.toml | head -50

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 738


🏁 Script executed:

cat setup.py 2>/dev/null || echo "setup.py not found"

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 90


🏁 Script executed:

rg -n 'create_tokenizer' --type=py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1406


Decouple optional sentencepiece dependency from package import.

The eager re-export in __init__.py pulls in SentencePieceTokenizer from factory.py, which in turn imports the optional sentencepiece library at module level. This breaks import openverifiablellm.tokenizer for users who only need BPE and don't have sentencepiece installed.

The fix requires lazy importing in two places:

  1. In __init__.py: Wrap the create_tokenizer re-export in a lazy import function
  2. In factory.py: Lazy-import SentencePieceTokenizer inside create_tokenizer() only when needed
Suggested fix

In openverifiablellm/tokenizer/__init__.py:

-from .factory import create_tokenizer
+def create_tokenizer(tokenizer_type, vocab_size, min_frequency):
+    from .factory import create_tokenizer as _create_tokenizer
+    return _create_tokenizer(tokenizer_type, vocab_size, min_frequency)

In openverifiablellm/tokenizer/factory.py:

-from .bpe_tokenizer import BPETokenizer
-from .sentencepiece_tokenizer import SentencePieceTokenizer
-
-
 def create_tokenizer(tokenizer_type, vocab_size, min_frequency):
+    from .bpe_tokenizer import BPETokenizer
 
     tokenizer_type = tokenizer_type.lower()
 
     if tokenizer_type == "bpe":
         return BPETokenizer(vocab_size, min_frequency)
 
     if tokenizer_type == "sentencepiece":
+        from .sentencepiece_tokenizer import SentencePieceTokenizer
         return SentencePieceTokenizer(vocab_size, min_frequency)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/tokenizer/__init__.py` at line 1, The package-level import
eagerly re-exports create_tokenizer which causes SentencePieceTokenizer (and the
optional sentencepiece dependency) to be imported at module import time; change
__init__.py to expose create_tokenizer via a lazy wrapper that imports
create_tokenizer only when called, and modify factory.py so
SentencePieceTokenizer (and any import of sentencepiece) is imported lazily
inside create_tokenizer() only when the sentencepiece-backed branch is selected;
update references to the factory-level create_tokenizer and the class name
SentencePieceTokenizer to ensure the dynamic import path is used.

dev = [
"pytest>=7.0",
"ruff>=0.15.4",
"numpy",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import pathlib, tomllib
data = tomllib.loads(pathlib.Path("pyproject.toml").read_text())
print("project.dependencies =", data.get("project", {}).get("dependencies", []))
print("dependency-groups.dev =", data.get("dependency-groups", {}).get("dev", []))
PY

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 196


Move numpy to core dependencies instead of dev-only.

Per repository policy, numpy must be a core dependency for this LLM-focused project. Keeping it in dependency-groups.dev causes installation and runtime divergence in non-dev environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 26, Move "numpy" out of the development dependency
group and add it to the project's core dependencies: remove "numpy" from
dependency-groups.dev and add it under the main dependencies section (the
project dependencies block in pyproject.toml). Update any existing dependency
listing to ensure only one entry for "numpy" remains (no duplicates) and run the
lock/build step to regenerate the lockfile so runtime installs include numpy.

@github-actions github-actions bot added size/L and removed size/L labels Mar 17, 2026
@github-actions github-actions bot added size/L and removed size/L labels Mar 17, 2026
@Shubhamx404
Copy link
Contributor Author

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@Shubhamx404 Shubhamx404 marked this pull request as draft March 17, 2026 17:57
@github-actions github-actions bot added size/XL and removed size/L labels Mar 18, 2026
@Shubhamx404 Shubhamx404 marked this pull request as ready for review March 18, 2026 14:34
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 18, 2026
@Shubhamx404 Shubhamx404 marked this pull request as draft March 18, 2026 14:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pyproject.toml (1)

23-28: ⚠️ Potential issue | 🟠 Major

Move numpy into core dependencies.

Right now it only lands in dependency-groups.dev, so regular installs still miss it.

This shows whether numpy is listed under [project].dependencies or only under the dev group:

#!/bin/bash
rg -n -C2 '^\[project\]|^dependencies = \[|^\[dependency-groups\]|^dev = \[|^\s*"numpy"|^\s*"pytest>=|^\s*"ruff>=' pyproject.toml

Based on learnings: In pyproject.toml for this repository, keep numpy and torch as core (required) dependencies rather than optional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 23 - 28, The pyproject currently lists "numpy"
only under the dev dependency group so normal installs won't get it; move
"numpy" (and ensure "torch" if present) into the core dependencies list under
[project].dependencies by removing "numpy" from the [dependency-groups] dev
array and adding it to the dependencies = [ ... ] block under [project], keeping
version constraints consistent with the dev entry; update the
[dependency-groups].dev array to remove "numpy" and verify there are no
duplicate entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-92: The local variable `hashable` is always a dict (derived
from `manifest_data`), so remove the redundant isinstance check and directly
remove the optional key: call `hashable.pop("parent_manifest_hash", None)` on
`hashable` (created from `manifest_data.copy()`) instead of guarding it with
`isinstance`; update any surrounding comments if present to reflect the
simplification in the logic within the manifest handling code.

In `@openverifiablellm/utils.py`:
- Around line 347-351: Persist the final checkpoint (the file referenced by
checkpoint_path and the in-memory pages_processed state) immediately before
calling generate_manifest(input_path, output_path) so that if
generate_manifest() fails the next run will resume from the correct
pages_processed value; update the flow in the block that currently checks
write_manifest to first write/save the checkpoint (or call the existing
checkpoint save routine) then call generate_manifest, and only unlink
checkpoint_path after the checkpoint has been successfully persisted and
manifest generation completes.
- Around line 291-345: The checkpoint saving calls pass the Path object
input_path to _save_checkpoint, causing JSON serialization errors; compute or
retrieve a string input_identity from the checkpoint (e.g., input_identity =
checkpoint.get("input_identity") or _compute_input_identity(input_path)) right
after _load_checkpoint in extract_text_from_xml, then replace the three calls
_save_checkpoint(checkpoint_path, pages_written, input_path) with
_save_checkpoint(checkpoint_path, pages_written, input_identity) (also update
the initial resume-save call that uses pages_written), ensuring _save_checkpoint
receives a str input_identity rather than a Path.

---

Duplicate comments:
In `@pyproject.toml`:
- Around line 23-28: The pyproject currently lists "numpy" only under the dev
dependency group so normal installs won't get it; move "numpy" (and ensure
"torch" if present) into the core dependencies list under [project].dependencies
by removing "numpy" from the [dependency-groups] dev array and adding it to the
dependencies = [ ... ] block under [project], keeping version constraints
consistent with the dev entry; update the [dependency-groups].dev array to
remove "numpy" and verify there are no duplicate entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd4df894-14b2-4eca-8b18-53229ed4dcbb

📥 Commits

Reviewing files that changed from the base of the PR and between 3123770 and 890d77c.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • openverifiablellm/manifest_chain.py
  • openverifiablellm/tokenizer/bpe_tokenizer.py
  • openverifiablellm/tokenizer/factory.py
  • openverifiablellm/tokenizer/sentencepiece_tokenizer.py
  • openverifiablellm/utils.py
  • openverifiablellm/verify.py
  • pyproject.toml
  • scripts/download_dump.py
  • scripts/preprocess_dump.py
  • scripts/verify_dataset.py
  • tests/test_manifest_chain.py
  • tests/test_util.py
  • tests/test_verify.py
💤 Files with no reviewable changes (2)
  • openverifiablellm/tokenizer/factory.py
  • openverifiablellm/tokenizer/bpe_tokenizer.py

Comment on lines +90 to +92
hashable = manifest_data.copy()
if isinstance(hashable, dict):
hashable.pop("parent_manifest_hash", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Redundant type check.

At this point in the code flow, hashable is always a dict (either from the input dict or from json.load), making the isinstance(hashable, dict) check unnecessary.

♻️ Simplify by removing redundant check
     hashable = manifest_data.copy()
-    if isinstance(hashable, dict):
-        hashable.pop("parent_manifest_hash", None)
+    hashable.pop("parent_manifest_hash", None)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hashable = manifest_data.copy()
if isinstance(hashable, dict):
hashable.pop("parent_manifest_hash", None)
hashable = manifest_data.copy()
hashable.pop("parent_manifest_hash", None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/manifest_chain.py` around lines 90 - 92, The local variable
`hashable` is always a dict (derived from `manifest_data`), so remove the
redundant isinstance check and directly remove the optional key: call
`hashable.pop("parent_manifest_hash", None)` on `hashable` (created from
`manifest_data.copy()`) instead of guarding it with `isinstance`; update any
surrounding comments if present to reflect the simplification in the logic
within the manifest handling code.

Comment on lines +291 to +345
checkpoint_path = _checkpoint_path(output_dir)

# Load checkpoint — tells us how many pages were already written
checkpoint = _load_checkpoint(checkpoint_path, input_path, output_path)
pages_already_done = checkpoint["pages_processed"]

# If resuming, append to existing output; otherwise start fresh
write_mode = "a" if pages_already_done > 0 else "w"

# Auto-detect file type using magic bytes separation
with open(input_path, "rb") as test_f:
is_bz2 = test_f.read(3) == b"BZh"

open_func = bz2.open if is_bz2 else open

with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))

with open(output_path, "w", encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
text_elem = elem.find(".//{*}text")
pages_seen = 0
pages_written = pages_already_done

if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")
try:
with open_func(input_path, "rb") as f:
context = ET.iterparse(f, events=("end",))

with open(output_path, write_mode, encoding="utf-8") as out:
for _, elem in context:
if elem.tag.endswith("page"):
pages_seen += 1

# Skip pages already processed in a previous run
if pages_seen <= pages_already_done:
elem.clear()
continue

text_elem = elem.find(".//{*}text")

if text_elem is not None and text_elem.text:
cleaned = clean_wikitext(text_elem.text)
if cleaned:
out.write(cleaned + "\n\n")

pages_written += 1
elem.clear()

# Flush output and save checkpoint periodically
if pages_written % CHECKPOINT_INTERVAL == 0:
out.flush()
_save_checkpoint(checkpoint_path, pages_written, input_path)
except KeyboardInterrupt:
_save_checkpoint(checkpoint_path, pages_written, input_path)
logger.warning("Interrupted by user after %d pages. Run again to resume.", pages_written)
raise
except Exception:
# Save progress before propagating the exception so the next run can resume
_save_checkpoint(checkpoint_path, pages_written, input_path)
logger.error("Processing interrupted after %d pages. Run again to resume.", pages_written)
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path
import json
try:
    json.dumps({"input_identity": Path("dump.xml.bz2")})
except TypeError as exc:
    print(type(exc).__name__, exc)
PY

rg -n -C2 '_save_checkpoint\(checkpoint_path, pages_written, input_path\)|def _save_checkpoint|input_identity' openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1715


🏁 Script executed:

# Get the full extract_text_from_xml function definition
sed -n '270,350p' openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3116


🏁 Script executed:

# Check what happens after line 345 (post-parse logic)
sed -n '345,360p' openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 547


Pass the hashed input identity into _save_checkpoint() instead of the input path.

Lines 336, 338, and 343 pass input_path to _save_checkpoint(), but the function expects input_identity: str (line 234). Because extract_text_from_xml() normalizes input_path to a Path object at line 283, json.dump() fails to serialize the checkpoint with TypeError: Object of type PosixPath is not JSON serializable, and checkpoint persistence silently fails. This breaks resume functionality on subsequent runs.

Compute or retrieve the string identity once and pass it to all three calls:

checkpoint = _load_checkpoint(checkpoint_path, input_path, output_path)
pages_already_done = checkpoint["pages_processed"]
input_identity = checkpoint.get("input_identity") or _compute_input_identity(input_path)

Then replace all three calls to use input_identity instead of input_path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 291 - 345, The checkpoint saving
calls pass the Path object input_path to _save_checkpoint, causing JSON
serialization errors; compute or retrieve a string input_identity from the
checkpoint (e.g., input_identity = checkpoint.get("input_identity") or
_compute_input_identity(input_path)) right after _load_checkpoint in
extract_text_from_xml, then replace the three calls
_save_checkpoint(checkpoint_path, pages_written, input_path) with
_save_checkpoint(checkpoint_path, pages_written, input_identity) (also update
the initial resume-save call that uses pages_written), ensuring _save_checkpoint
receives a str input_identity rather than a Path.

Comment on lines +347 to +351
# Processing finished successfully — remove checkpoint so a fresh
# re-run (if ever needed) starts from the beginning
if write_manifest:
generate_manifest(input_path, output_path)
checkpoint_path.unlink(missing_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist the final checkpoint before generate_manifest().

Line 349 can still fail after the last periodic save. When that happens, the next run resumes from a stale pages_processed value and re-appends the tail of wiki_clean.txt.

💡 Suggested fix
-    if write_manifest:
-        generate_manifest(input_path, output_path)
+    _save_checkpoint(checkpoint_path, pages_written, input_identity)
+    if write_manifest:
+        generate_manifest(input_path, output_path)
     checkpoint_path.unlink(missing_ok=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 347 - 351, Persist the final
checkpoint (the file referenced by checkpoint_path and the in-memory
pages_processed state) immediately before calling generate_manifest(input_path,
output_path) so that if generate_manifest() fails the next run will resume from
the correct pages_processed value; update the flow in the block that currently
checks write_manifest to first write/save the checkpoint (or call the existing
checkpoint save routine) then call generate_manifest, and only unlink
checkpoint_path after the checkpoint has been successfully persisted and
manifest generation completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add Unit Tests for create_tokenizer and load_merkle_proof

3 participants