[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof#78
[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof#78Shubhamx404 wants to merge 36 commits intoAOSSIE-Org:mainfrom
Conversation
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/tokenizer/__init__.py`:
- Line 1: The package-level import eagerly re-exports create_tokenizer which
causes SentencePieceTokenizer (and the optional sentencepiece dependency) to be
imported at module import time; change __init__.py to expose create_tokenizer
via a lazy wrapper that imports create_tokenizer only when called, and modify
factory.py so SentencePieceTokenizer (and any import of sentencepiece) is
imported lazily inside create_tokenizer() only when the sentencepiece-backed
branch is selected; update references to the factory-level create_tokenizer and
the class name SentencePieceTokenizer to ensure the dynamic import path is used.
In `@pyproject.toml`:
- Around line 37-39: There are duplicate TOML tables for
[tool.pytest.ini_options] causing a CI TOML parse error; locate the existing
[tool.pytest.ini_options] table and merge the new keys (e.g. filterwarnings =
[...]) into that single table, then remove this duplicate table entry so only
one [tool.pytest.ini_options] block remains in pyproject.toml.
- Line 26: Move "numpy" out of the development dependency group and add it to
the project's core dependencies: remove "numpy" from dependency-groups.dev and
add it under the main dependencies section (the project dependencies block in
pyproject.toml). Update any existing dependency listing to ensure only one entry
for "numpy" remains (no duplicates) and run the lock/build step to regenerate
the lockfile so runtime installs include numpy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 54f61b50-1240-48c7-8bba-6da56fa2efee
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
openverifiablellm/tokenizer/__init__.pyopenverifiablellm/tokenizer/sentencepiece_tokenizer.pypyproject.tomltests/test_tokenizer.pytests/test_util.pytests/test_verify.py
| @@ -1,6 +1,8 @@ | |||
| from .factory import create_tokenizer | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C2 --type=py 'from \.factory import create_tokenizer|from \.sentencepiece_tokenizer import SentencePieceTokenizer|import sentencepiece as spm'Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1413
🏁 Script executed:
cat -n openverifiablellm/tokenizer/factory.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 650
🏁 Script executed:
head -20 openverifiablellm/tokenizer/sentencepiece_tokenizer.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 649
🏁 Script executed:
cat pyproject.toml | head -50Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 738
🏁 Script executed:
cat setup.py 2>/dev/null || echo "setup.py not found"Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 90
🏁 Script executed:
rg -n 'create_tokenizer' --type=pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1406
Decouple optional sentencepiece dependency from package import.
The eager re-export in __init__.py pulls in SentencePieceTokenizer from factory.py, which in turn imports the optional sentencepiece library at module level. This breaks import openverifiablellm.tokenizer for users who only need BPE and don't have sentencepiece installed.
The fix requires lazy importing in two places:
- In
__init__.py: Wrap thecreate_tokenizerre-export in a lazy import function - In
factory.py: Lazy-importSentencePieceTokenizerinsidecreate_tokenizer()only when needed
Suggested fix
In openverifiablellm/tokenizer/__init__.py:
-from .factory import create_tokenizer
+def create_tokenizer(tokenizer_type, vocab_size, min_frequency):
+ from .factory import create_tokenizer as _create_tokenizer
+ return _create_tokenizer(tokenizer_type, vocab_size, min_frequency)In openverifiablellm/tokenizer/factory.py:
-from .bpe_tokenizer import BPETokenizer
-from .sentencepiece_tokenizer import SentencePieceTokenizer
-
-
def create_tokenizer(tokenizer_type, vocab_size, min_frequency):
+ from .bpe_tokenizer import BPETokenizer
tokenizer_type = tokenizer_type.lower()
if tokenizer_type == "bpe":
return BPETokenizer(vocab_size, min_frequency)
if tokenizer_type == "sentencepiece":
+ from .sentencepiece_tokenizer import SentencePieceTokenizer
return SentencePieceTokenizer(vocab_size, min_frequency)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/tokenizer/__init__.py` at line 1, The package-level import
eagerly re-exports create_tokenizer which causes SentencePieceTokenizer (and the
optional sentencepiece dependency) to be imported at module import time; change
__init__.py to expose create_tokenizer via a lazy wrapper that imports
create_tokenizer only when called, and modify factory.py so
SentencePieceTokenizer (and any import of sentencepiece) is imported lazily
inside create_tokenizer() only when the sentencepiece-backed branch is selected;
update references to the factory-level create_tokenizer and the class name
SentencePieceTokenizer to ensure the dynamic import path is used.
| dev = [ | ||
| "pytest>=7.0", | ||
| "ruff>=0.15.4", | ||
| "numpy", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import pathlib, tomllib
data = tomllib.loads(pathlib.Path("pyproject.toml").read_text())
print("project.dependencies =", data.get("project", {}).get("dependencies", []))
print("dependency-groups.dev =", data.get("dependency-groups", {}).get("dev", []))
PYRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 196
Move numpy to core dependencies instead of dev-only.
Per repository policy, numpy must be a core dependency for this LLM-focused project. Keeping it in dependency-groups.dev causes installation and runtime divergence in non-dev environments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 26, Move "numpy" out of the development dependency
group and add it to the project's core dependencies: remove "numpy" from
dependency-groups.dev and add it under the main dependencies section (the
project dependencies block in pyproject.toml). Update any existing dependency
listing to ensure only one entry for "numpy" remains (no duplicates) and run the
lock/build step to regenerate the lockfile so runtime installs include numpy.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Incremental Preprocessing with Checkpoints to avoid halfway fail-restart
Add malformed XML test and fix verify.py regression
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
pyproject.toml (1)
23-28:⚠️ Potential issue | 🟠 MajorMove
numpyinto core dependencies.Right now it only lands in
dependency-groups.dev, so regular installs still miss it.This shows whether
numpyis listed under[project].dependenciesor only under the dev group:#!/bin/bash rg -n -C2 '^\[project\]|^dependencies = \[|^\[dependency-groups\]|^dev = \[|^\s*"numpy"|^\s*"pytest>=|^\s*"ruff>=' pyproject.tomlBased on learnings: In
pyproject.tomlfor this repository, keepnumpyandtorchas core (required) dependencies rather than optional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 23 - 28, The pyproject currently lists "numpy" only under the dev dependency group so normal installs won't get it; move "numpy" (and ensure "torch" if present) into the core dependencies list under [project].dependencies by removing "numpy" from the [dependency-groups] dev array and adding it to the dependencies = [ ... ] block under [project], keeping version constraints consistent with the dev entry; update the [dependency-groups].dev array to remove "numpy" and verify there are no duplicate entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-92: The local variable `hashable` is always a dict (derived
from `manifest_data`), so remove the redundant isinstance check and directly
remove the optional key: call `hashable.pop("parent_manifest_hash", None)` on
`hashable` (created from `manifest_data.copy()`) instead of guarding it with
`isinstance`; update any surrounding comments if present to reflect the
simplification in the logic within the manifest handling code.
In `@openverifiablellm/utils.py`:
- Around line 347-351: Persist the final checkpoint (the file referenced by
checkpoint_path and the in-memory pages_processed state) immediately before
calling generate_manifest(input_path, output_path) so that if
generate_manifest() fails the next run will resume from the correct
pages_processed value; update the flow in the block that currently checks
write_manifest to first write/save the checkpoint (or call the existing
checkpoint save routine) then call generate_manifest, and only unlink
checkpoint_path after the checkpoint has been successfully persisted and
manifest generation completes.
- Around line 291-345: The checkpoint saving calls pass the Path object
input_path to _save_checkpoint, causing JSON serialization errors; compute or
retrieve a string input_identity from the checkpoint (e.g., input_identity =
checkpoint.get("input_identity") or _compute_input_identity(input_path)) right
after _load_checkpoint in extract_text_from_xml, then replace the three calls
_save_checkpoint(checkpoint_path, pages_written, input_path) with
_save_checkpoint(checkpoint_path, pages_written, input_identity) (also update
the initial resume-save call that uses pages_written), ensuring _save_checkpoint
receives a str input_identity rather than a Path.
---
Duplicate comments:
In `@pyproject.toml`:
- Around line 23-28: The pyproject currently lists "numpy" only under the dev
dependency group so normal installs won't get it; move "numpy" (and ensure
"torch" if present) into the core dependencies list under [project].dependencies
by removing "numpy" from the [dependency-groups] dev array and adding it to the
dependencies = [ ... ] block under [project], keeping version constraints
consistent with the dev entry; update the [dependency-groups].dev array to
remove "numpy" and verify there are no duplicate entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bd4df894-14b2-4eca-8b18-53229ed4dcbb
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
openverifiablellm/manifest_chain.pyopenverifiablellm/tokenizer/bpe_tokenizer.pyopenverifiablellm/tokenizer/factory.pyopenverifiablellm/tokenizer/sentencepiece_tokenizer.pyopenverifiablellm/utils.pyopenverifiablellm/verify.pypyproject.tomlscripts/download_dump.pyscripts/preprocess_dump.pyscripts/verify_dataset.pytests/test_manifest_chain.pytests/test_util.pytests/test_verify.py
💤 Files with no reviewable changes (2)
- openverifiablellm/tokenizer/factory.py
- openverifiablellm/tokenizer/bpe_tokenizer.py
| hashable = manifest_data.copy() | ||
| if isinstance(hashable, dict): | ||
| hashable.pop("parent_manifest_hash", None) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Redundant type check.
At this point in the code flow, hashable is always a dict (either from the input dict or from json.load), making the isinstance(hashable, dict) check unnecessary.
♻️ Simplify by removing redundant check
hashable = manifest_data.copy()
- if isinstance(hashable, dict):
- hashable.pop("parent_manifest_hash", None)
+ hashable.pop("parent_manifest_hash", None)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hashable = manifest_data.copy() | |
| if isinstance(hashable, dict): | |
| hashable.pop("parent_manifest_hash", None) | |
| hashable = manifest_data.copy() | |
| hashable.pop("parent_manifest_hash", None) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/manifest_chain.py` around lines 90 - 92, The local variable
`hashable` is always a dict (derived from `manifest_data`), so remove the
redundant isinstance check and directly remove the optional key: call
`hashable.pop("parent_manifest_hash", None)` on `hashable` (created from
`manifest_data.copy()`) instead of guarding it with `isinstance`; update any
surrounding comments if present to reflect the simplification in the logic
within the manifest handling code.
| checkpoint_path = _checkpoint_path(output_dir) | ||
|
|
||
| # Load checkpoint — tells us how many pages were already written | ||
| checkpoint = _load_checkpoint(checkpoint_path, input_path, output_path) | ||
| pages_already_done = checkpoint["pages_processed"] | ||
|
|
||
| # If resuming, append to existing output; otherwise start fresh | ||
| write_mode = "a" if pages_already_done > 0 else "w" | ||
|
|
||
| # Auto-detect file type using magic bytes separation | ||
| with open(input_path, "rb") as test_f: | ||
| is_bz2 = test_f.read(3) == b"BZh" | ||
|
|
||
| open_func = bz2.open if is_bz2 else open | ||
|
|
||
| with open_func(input_path, "rb") as f: | ||
| context = ET.iterparse(f, events=("end",)) | ||
|
|
||
| with open(output_path, "w", encoding="utf-8") as out: | ||
| for _, elem in context: | ||
| if elem.tag.endswith("page"): | ||
| text_elem = elem.find(".//{*}text") | ||
| pages_seen = 0 | ||
| pages_written = pages_already_done | ||
|
|
||
| if text_elem is not None and text_elem.text: | ||
| cleaned = clean_wikitext(text_elem.text) | ||
| if cleaned: | ||
| out.write(cleaned + "\n\n") | ||
| try: | ||
| with open_func(input_path, "rb") as f: | ||
| context = ET.iterparse(f, events=("end",)) | ||
|
|
||
| with open(output_path, write_mode, encoding="utf-8") as out: | ||
| for _, elem in context: | ||
| if elem.tag.endswith("page"): | ||
| pages_seen += 1 | ||
|
|
||
| # Skip pages already processed in a previous run | ||
| if pages_seen <= pages_already_done: | ||
| elem.clear() | ||
| continue | ||
|
|
||
| text_elem = elem.find(".//{*}text") | ||
|
|
||
| if text_elem is not None and text_elem.text: | ||
| cleaned = clean_wikitext(text_elem.text) | ||
| if cleaned: | ||
| out.write(cleaned + "\n\n") | ||
|
|
||
| pages_written += 1 | ||
| elem.clear() | ||
|
|
||
| # Flush output and save checkpoint periodically | ||
| if pages_written % CHECKPOINT_INTERVAL == 0: | ||
| out.flush() | ||
| _save_checkpoint(checkpoint_path, pages_written, input_path) | ||
| except KeyboardInterrupt: | ||
| _save_checkpoint(checkpoint_path, pages_written, input_path) | ||
| logger.warning("Interrupted by user after %d pages. Run again to resume.", pages_written) | ||
| raise | ||
| except Exception: | ||
| # Save progress before propagating the exception so the next run can resume | ||
| _save_checkpoint(checkpoint_path, pages_written, input_path) | ||
| logger.error("Processing interrupted after %d pages. Run again to resume.", pages_written) | ||
| raise |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
from pathlib import Path
import json
try:
json.dumps({"input_identity": Path("dump.xml.bz2")})
except TypeError as exc:
print(type(exc).__name__, exc)
PY
rg -n -C2 '_save_checkpoint\(checkpoint_path, pages_written, input_path\)|def _save_checkpoint|input_identity' openverifiablellm/utils.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1715
🏁 Script executed:
# Get the full extract_text_from_xml function definition
sed -n '270,350p' openverifiablellm/utils.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 3116
🏁 Script executed:
# Check what happens after line 345 (post-parse logic)
sed -n '345,360p' openverifiablellm/utils.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 547
Pass the hashed input identity into _save_checkpoint() instead of the input path.
Lines 336, 338, and 343 pass input_path to _save_checkpoint(), but the function expects input_identity: str (line 234). Because extract_text_from_xml() normalizes input_path to a Path object at line 283, json.dump() fails to serialize the checkpoint with TypeError: Object of type PosixPath is not JSON serializable, and checkpoint persistence silently fails. This breaks resume functionality on subsequent runs.
Compute or retrieve the string identity once and pass it to all three calls:
checkpoint = _load_checkpoint(checkpoint_path, input_path, output_path)
pages_already_done = checkpoint["pages_processed"]
input_identity = checkpoint.get("input_identity") or _compute_input_identity(input_path)Then replace all three calls to use input_identity instead of input_path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 291 - 345, The checkpoint saving
calls pass the Path object input_path to _save_checkpoint, causing JSON
serialization errors; compute or retrieve a string input_identity from the
checkpoint (e.g., input_identity = checkpoint.get("input_identity") or
_compute_input_identity(input_path)) right after _load_checkpoint in
extract_text_from_xml, then replace the three calls
_save_checkpoint(checkpoint_path, pages_written, input_path) with
_save_checkpoint(checkpoint_path, pages_written, input_identity) (also update
the initial resume-save call that uses pages_written), ensuring _save_checkpoint
receives a str input_identity rather than a Path.
| # Processing finished successfully — remove checkpoint so a fresh | ||
| # re-run (if ever needed) starts from the beginning | ||
| if write_manifest: | ||
| generate_manifest(input_path, output_path) | ||
| checkpoint_path.unlink(missing_ok=True) |
There was a problem hiding this comment.
Persist the final checkpoint before generate_manifest().
Line 349 can still fail after the last periodic save. When that happens, the next run resumes from a stale pages_processed value and re-appends the tail of wiki_clean.txt.
💡 Suggested fix
- if write_manifest:
- generate_manifest(input_path, output_path)
+ _save_checkpoint(checkpoint_path, pages_written, input_identity)
+ if write_manifest:
+ generate_manifest(input_path, output_path)
checkpoint_path.unlink(missing_ok=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 347 - 351, Persist the final
checkpoint (the file referenced by checkpoint_path and the in-memory
pages_processed state) immediately before calling generate_manifest(input_path,
output_path) so that if generate_manifest() fails the next run will resume from
the correct pages_processed value; update the flow in the block that currently
checks write_manifest to first write/save the checkpoint (or call the existing
checkpoint save routine) then call generate_manifest, and only unlink
checkpoint_path after the checkpoint has been successfully persisted and
manifest generation completes.
Addressed Issues:
Fixes #77
Overview
The repository currently lacks unit tests for two utility functions:
create_tokenizerintests/test_tokenizer.pyandload_merkle_proofintests/test_util.pySolution
Changes Made
1. Export
create_tokenizerAdded
create_tokenizertoopenverifiablellm/tokenizer/__init__.pyEnables clean imports:
Tests for create_tokenizer (tests/test_tokenizer.py)
Tests for load_merkle_proof (tests/test_util.py)
Screenshots/Recordings:
Additional Notes:
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Improvements
Tests
Chores