[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof by Shubhamx404 · Pull Request #78 · AOSSIE-Org/OpenVerifiableLLM

Shubhamx404 · 2026-03-17T07:16:10Z

Addressed Issues:

Fixes #77

Overview

The repository currently lacks unit tests for two utility functions:
create_tokenizer in tests/test_tokenizer.py and load_merkle_proof in tests/test_util.py

This lack of test coverage creates blind spots, making the code harder to safely refactor and potentially allowing bugs to be introduced without being detected by our test suite.

Solution

Changes Made

1. Export `create_tokenizer`

Added create_tokenizer to openverifiablellm/tokenizer/__init__.py
Enables clean imports:

from openverifiablellm.tokenizer import create_tokenizer

Tests for create_tokenizer (tests/test_tokenizer.py)

Added unit tests to validate factory behavior:
Returns BPETokenizer when tokenizer_type="bpe"
Returns SentencePieceTokenizer when tokenizer_type="sentencepiece"
Raises ValueError for unsupported tokenizer types

Tests for load_merkle_proof (tests/test_util.py)

Successfully loads valid JSON Merkle proof
Raises FileNotFoundError for missing files
Raises json.JSONDecodeError for malformed JSON

Screenshots/Recordings:

Additional Notes:

Checklist

[ * ] My code follows the project's code style and conventions
[ * ] I have made corresponding changes to the documentation
[ * ] My changes generate no new warnings or errors
[ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
[ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Added manifest chain hashing & verification to detect tampering and link manifests
- Checkpointable resuming for large text extraction and improved manifest linkage
- New CLIs to preprocess dumps and verify datasets; previous-manifest support
Improvements
- Exposed a tokenizer factory function and silenced a third‑party deprecation warning during import
- Added raw-bytes SHA256 helper and centralized hashing
Tests
- Comprehensive tests for manifest chain, tokenizer factory, utilities, and verification flows
Chores
- Added numpy to dev dependencies

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/tokenizer/__init__.py`:
- Line 1: The package-level import eagerly re-exports create_tokenizer which
causes SentencePieceTokenizer (and the optional sentencepiece dependency) to be
imported at module import time; change __init__.py to expose create_tokenizer
via a lazy wrapper that imports create_tokenizer only when called, and modify
factory.py so SentencePieceTokenizer (and any import of sentencepiece) is
imported lazily inside create_tokenizer() only when the sentencepiece-backed
branch is selected; update references to the factory-level create_tokenizer and
the class name SentencePieceTokenizer to ensure the dynamic import path is used.

In `@pyproject.toml`:
- Around line 37-39: There are duplicate TOML tables for
[tool.pytest.ini_options] causing a CI TOML parse error; locate the existing
[tool.pytest.ini_options] table and merge the new keys (e.g. filterwarnings =
[...]) into that single table, then remove this duplicate table entry so only
one [tool.pytest.ini_options] block remains in pyproject.toml.
- Line 26: Move "numpy" out of the development dependency group and add it to
the project's core dependencies: remove "numpy" from dependency-groups.dev and
add it under the main dependencies section (the project dependencies block in
pyproject.toml). Update any existing dependency listing to ensure only one entry
for "numpy" remains (no duplicates) and run the lock/build step to regenerate
the lockfile so runtime installs include numpy.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 54f61b50-1240-48c7-8bba-6da56fa2efee

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and 05c6b3d.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

openverifiablellm/tokenizer/__init__.py
openverifiablellm/tokenizer/sentencepiece_tokenizer.py
pyproject.toml
tests/test_tokenizer.py
tests/test_util.py
tests/test_verify.py

coderabbitai · 2026-03-17T07:22:49Z

openverifiablellm/tokenizer/__init__.py

@@ -1,6 +1,8 @@
+from .factory import create_tokenizer


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n -C2 --type=py 'from \.factory import create_tokenizer|from \.sentencepiece_tokenizer import SentencePieceTokenizer|import sentencepiece as spm'

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1413

🏁 Script executed:

cat -n openverifiablellm/tokenizer/factory.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 650

🏁 Script executed:

head -20 openverifiablellm/tokenizer/sentencepiece_tokenizer.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 649

🏁 Script executed:

cat pyproject.toml | head -50

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 738

🏁 Script executed:

cat setup.py 2>/dev/null || echo "setup.py not found"

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 90

🏁 Script executed:

rg -n 'create_tokenizer' --type=py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1406

Decouple optional sentencepiece dependency from package import.

The eager re-export in __init__.py pulls in SentencePieceTokenizer from factory.py, which in turn imports the optional sentencepiece library at module level. This breaks import openverifiablellm.tokenizer for users who only need BPE and don't have sentencepiece installed.

The fix requires lazy importing in two places:

In __init__.py: Wrap the create_tokenizer re-export in a lazy import function

In factory.py: Lazy-import SentencePieceTokenizer inside create_tokenizer() only when needed

Suggested fix

In openverifiablellm/tokenizer/__init__.py:

-from .factory import create_tokenizer +def create_tokenizer(tokenizer_type, vocab_size, min_frequency): + from .factory import create_tokenizer as _create_tokenizer + return _create_tokenizer(tokenizer_type, vocab_size, min_frequency)

In openverifiablellm/tokenizer/factory.py:

-from .bpe_tokenizer import BPETokenizer -from .sentencepiece_tokenizer import SentencePieceTokenizer - - def create_tokenizer(tokenizer_type, vocab_size, min_frequency): + from .bpe_tokenizer import BPETokenizer tokenizer_type = tokenizer_type.lower() if tokenizer_type == "bpe": return BPETokenizer(vocab_size, min_frequency) if tokenizer_type == "sentencepiece": + from .sentencepiece_tokenizer import SentencePieceTokenizer return SentencePieceTokenizer(vocab_size, min_frequency)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/tokenizer/__init__.py` at line 1, The package-level import eagerly re-exports create_tokenizer which causes SentencePieceTokenizer (and the optional sentencepiece dependency) to be imported at module import time; change __init__.py to expose create_tokenizer via a lazy wrapper that imports create_tokenizer only when called, and modify factory.py so SentencePieceTokenizer (and any import of sentencepiece) is imported lazily inside create_tokenizer() only when the sentencepiece-backed branch is selected; update references to the factory-level create_tokenizer and the class name SentencePieceTokenizer to ensure the dynamic import path is used.

coderabbitai · 2026-03-17T07:22:50Z

pyproject.toml

 dev = [
    "pytest>=7.0",
    "ruff>=0.15.4",
+    "numpy",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import pathlib, tomllib data = tomllib.loads(pathlib.Path("pyproject.toml").read_text()) print("project.dependencies =", data.get("project", {}).get("dependencies", [])) print("dependency-groups.dev =", data.get("dependency-groups", {}).get("dev", [])) PY

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 196

Move numpy to core dependencies instead of dev-only.

Per repository policy, numpy must be a core dependency for this LLM-focused project. Keeping it in dependency-groups.dev causes installation and runtime divergence in non-dev environments.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 26, Move "numpy" out of the development dependency group and add it to the project's core dependencies: remove "numpy" from dependency-groups.dev and add it under the main dependencies section (the project dependencies block in pyproject.toml). Update any existing dependency listing to ensure only one entry for "numpy" remains (no duplicates) and run the lock/build step to regenerate the lockfile so runtime installs include numpy.

pyproject.toml

Shubhamx404 · 2026-03-17T07:26:56Z

@CodeRabbit review

coderabbitai · 2026-03-17T07:27:00Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Incremental Preprocessing with Checkpoints to avoid halfway fail-restart

Add malformed XML test and fix verify.py regression

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

pyproject.toml (1)
23-28: ⚠️ Potential issue | 🟠 Major

Move numpy into core dependencies.

Right now it only lands in dependency-groups.dev, so regular installs still miss it.

This shows whether numpy is listed under [project].dependencies or only under the dev group:
#!/bin/bash
rg -n -C2 '^\[project\]|^dependencies = \[|^\[dependency-groups\]|^dev = \[|^\s*"numpy"|^\s*"pytest>=|^\s*"ruff>=' pyproject.toml
Based on learnings: In pyproject.toml for this repository, keep numpy and torch as core (required) dependencies rather than optional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 23 - 28, The pyproject currently lists "numpy"
only under the dev dependency group so normal installs won't get it; move
"numpy" (and ensure "torch" if present) into the core dependencies list under
[project].dependencies by removing "numpy" from the [dependency-groups] dev
array and adding it to the dependencies = [ ... ] block under [project], keeping
version constraints consistent with the dev entry; update the
[dependency-groups].dev array to remove "numpy" and verify there are no
duplicate entries.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-92: The local variable `hashable` is always a dict (derived
from `manifest_data`), so remove the redundant isinstance check and directly
remove the optional key: call `hashable.pop("parent_manifest_hash", None)` on
`hashable` (created from `manifest_data.copy()`) instead of guarding it with
`isinstance`; update any surrounding comments if present to reflect the
simplification in the logic within the manifest handling code.

In `@openverifiablellm/utils.py`:
- Around line 347-351: Persist the final checkpoint (the file referenced by
checkpoint_path and the in-memory pages_processed state) immediately before
calling generate_manifest(input_path, output_path) so that if
generate_manifest() fails the next run will resume from the correct
pages_processed value; update the flow in the block that currently checks
write_manifest to first write/save the checkpoint (or call the existing
checkpoint save routine) then call generate_manifest, and only unlink
checkpoint_path after the checkpoint has been successfully persisted and
manifest generation completes.
- Around line 291-345: The checkpoint saving calls pass the Path object
input_path to _save_checkpoint, causing JSON serialization errors; compute or
retrieve a string input_identity from the checkpoint (e.g., input_identity =
checkpoint.get("input_identity") or _compute_input_identity(input_path)) right
after _load_checkpoint in extract_text_from_xml, then replace the three calls
_save_checkpoint(checkpoint_path, pages_written, input_path) with
_save_checkpoint(checkpoint_path, pages_written, input_identity) (also update
the initial resume-save call that uses pages_written), ensuring _save_checkpoint
receives a str input_identity rather than a Path.

---

Duplicate comments:
In `@pyproject.toml`:
- Around line 23-28: The pyproject currently lists "numpy" only under the dev
dependency group so normal installs won't get it; move "numpy" (and ensure
"torch" if present) into the core dependencies list under [project].dependencies
by removing "numpy" from the [dependency-groups] dev array and adding it to the
dependencies = [ ... ] block under [project], keeping version constraints
consistent with the dev entry; update the [dependency-groups].dev array to
remove "numpy" and verify there are no duplicate entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bd4df894-14b2-4eca-8b18-53229ed4dcbb

📥 Commits

Reviewing files that changed from the base of the PR and between 3123770 and 890d77c.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (13)

openverifiablellm/manifest_chain.py
openverifiablellm/tokenizer/bpe_tokenizer.py
openverifiablellm/tokenizer/factory.py
openverifiablellm/tokenizer/sentencepiece_tokenizer.py
openverifiablellm/utils.py
openverifiablellm/verify.py
pyproject.toml
scripts/download_dump.py
scripts/preprocess_dump.py
scripts/verify_dataset.py
tests/test_manifest_chain.py
tests/test_util.py
tests/test_verify.py

💤 Files with no reviewable changes (2)

openverifiablellm/tokenizer/factory.py
openverifiablellm/tokenizer/bpe_tokenizer.py

coderabbitai · 2026-03-18T14:46:05Z

openverifiablellm/manifest_chain.py

+    hashable = manifest_data.copy()
+    if isinstance(hashable, dict):
+        hashable.pop("parent_manifest_hash", None)


🧹 Nitpick | 🔵 Trivial

Redundant type check.

At this point in the code flow, hashable is always a dict (either from the input dict or from json.load), making the isinstance(hashable, dict) check unnecessary.

♻️ Simplify by removing redundant check

hashable = manifest_data.copy() - if isinstance(hashable, dict): - hashable.pop("parent_manifest_hash", None) + hashable.pop("parent_manifest_hash", None)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hashable = manifest_data.copy()

if isinstance(hashable, dict):

hashable.pop("parent_manifest_hash", None)

hashable = manifest_data.copy()

hashable.pop("parent_manifest_hash", None)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/manifest_chain.py` around lines 90 - 92, The local variable `hashable` is always a dict (derived from `manifest_data`), so remove the redundant isinstance check and directly remove the optional key: call `hashable.pop("parent_manifest_hash", None)` on `hashable` (created from `manifest_data.copy()`) instead of guarding it with `isinstance`; update any surrounding comments if present to reflect the simplification in the logic within the manifest handling code.

coderabbitai · 2026-03-18T14:46:05Z

openverifiablellm/utils.py

+    checkpoint_path = _checkpoint_path(output_dir)
+
+    # Load checkpoint — tells us how many pages were already written
+    checkpoint = _load_checkpoint(checkpoint_path, input_path, output_path)
+    pages_already_done = checkpoint["pages_processed"]
+
+    # If resuming, append to existing output; otherwise start fresh
+    write_mode = "a" if pages_already_done > 0 else "w"

    # Auto-detect file type using magic bytes separation
    with open(input_path, "rb") as test_f:
        is_bz2 = test_f.read(3) == b"BZh"

    open_func = bz2.open if is_bz2 else open

-    with open_func(input_path, "rb") as f:
-        context = ET.iterparse(f, events=("end",))
-
-        with open(output_path, "w", encoding="utf-8") as out:
-            for _, elem in context:
-                if elem.tag.endswith("page"):
-                    text_elem = elem.find(".//{*}text")
+    pages_seen = 0
+    pages_written = pages_already_done

-                    if text_elem is not None and text_elem.text:
-                        cleaned = clean_wikitext(text_elem.text)
-                        if cleaned:
-                            out.write(cleaned + "\n\n")
+    try:
+        with open_func(input_path, "rb") as f:
+            context = ET.iterparse(f, events=("end",))
+
+            with open(output_path, write_mode, encoding="utf-8") as out:
+                for _, elem in context:
+                    if elem.tag.endswith("page"):
+                        pages_seen += 1
+
+                        # Skip pages already processed in a previous run
+                        if pages_seen <= pages_already_done:
+                            elem.clear()
+                            continue
+
+                        text_elem = elem.find(".//{*}text")
+
+                        if text_elem is not None and text_elem.text:
+                            cleaned = clean_wikitext(text_elem.text)
+                            if cleaned:
+                                out.write(cleaned + "\n\n")
+
+                        pages_written += 1
+                        elem.clear()
+
+                        # Flush output and save checkpoint periodically
+                        if pages_written % CHECKPOINT_INTERVAL == 0:
+                            out.flush()
+                            _save_checkpoint(checkpoint_path, pages_written, input_path)
+    except KeyboardInterrupt:
+        _save_checkpoint(checkpoint_path, pages_written, input_path)
+        logger.warning("Interrupted by user after %d pages. Run again to resume.", pages_written)
+        raise
+    except Exception:
+        # Save progress before propagating the exception so the next run can resume
+        _save_checkpoint(checkpoint_path, pages_written, input_path)
+        logger.error("Processing interrupted after %d pages. Run again to resume.", pages_written)
+        raise


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' from pathlib import Path import json try: json.dumps({"input_identity": Path("dump.xml.bz2")}) except TypeError as exc: print(type(exc).__name__, exc) PY rg -n -C2 '_save_checkpoint\(checkpoint_path, pages_written, input_path\)|def _save_checkpoint|input_identity' openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1715

🏁 Script executed:

# Get the full extract_text_from_xml function definition sed -n '270,350p' openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3116

🏁 Script executed:

# Check what happens after line 345 (post-parse logic) sed -n '345,360p' openverifiablellm/utils.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 547

Pass the hashed input identity into _save_checkpoint() instead of the input path.

Lines 336, 338, and 343 pass input_path to _save_checkpoint(), but the function expects input_identity: str (line 234). Because extract_text_from_xml() normalizes input_path to a Path object at line 283, json.dump() fails to serialize the checkpoint with TypeError: Object of type PosixPath is not JSON serializable, and checkpoint persistence silently fails. This breaks resume functionality on subsequent runs.

Compute or retrieve the string identity once and pass it to all three calls:

checkpoint = _load_checkpoint(checkpoint_path, input_path, output_path) pages_already_done = checkpoint["pages_processed"] input_identity = checkpoint.get("input_identity") or _compute_input_identity(input_path)

Then replace all three calls to use input_identity instead of input_path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 291 - 345, The checkpoint saving calls pass the Path object input_path to _save_checkpoint, causing JSON serialization errors; compute or retrieve a string input_identity from the checkpoint (e.g., input_identity = checkpoint.get("input_identity") or _compute_input_identity(input_path)) right after _load_checkpoint in extract_text_from_xml, then replace the three calls _save_checkpoint(checkpoint_path, pages_written, input_path) with _save_checkpoint(checkpoint_path, pages_written, input_identity) (also update the initial resume-save call that uses pages_written), ensuring _save_checkpoint receives a str input_identity rather than a Path.

coderabbitai · 2026-03-18T14:46:05Z

openverifiablellm/utils.py

+    # Processing finished successfully — remove checkpoint so a fresh
+    # re-run (if ever needed) starts from the beginning
    if write_manifest:
        generate_manifest(input_path, output_path)
+    checkpoint_path.unlink(missing_ok=True)


⚠️ Potential issue | 🟠 Major

Persist the final checkpoint before generate_manifest().

Line 349 can still fail after the last periodic save. When that happens, the next run resumes from a stale pages_processed value and re-appends the tail of wiki_clean.txt.

💡 Suggested fix

- if write_manifest: - generate_manifest(input_path, output_path) + _save_checkpoint(checkpoint_path, pages_written, input_identity) + if write_manifest: + generate_manifest(input_path, output_path) checkpoint_path.unlink(missing_ok=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 347 - 351, Persist the final checkpoint (the file referenced by checkpoint_path and the in-memory pages_processed state) immediately before calling generate_manifest(input_path, output_path) so that if generate_manifest() fails the next run will resume from the correct pages_processed value; update the flow in the block that currently checks write_manifest to first write/save the checkpoint (or call the existing checkpoint save routine) then call generate_manifest, and only unlink checkpoint_path after the checkpoint has been successfully persisted and manifest generation completes.

Shubhamx404 and others added 23 commits March 8, 2026 02:10

Add malformed XML edge case test and fix verify.py regression

d741214

Update openverifiablellm/verify.py

e923a03

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

fixed coderabbit issue

0d0626a

Merge branch 'main' into fix-malformed-xml-test

573ad01

Chain-linking

9c28b5a

Code rabbit follow up

e442106

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Code-rabbit-follow-up

93ba434

Code rabbit follow up

6ac7c7d

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

testing with UV

c94d335

resume-preprocessing

ff689a9

Merge branch 'main' into chain

2ae4c5a

fixing-linting-issues

0b59f92

ruff-format

824659f

added-script

5cad61c

added-script

8df62f8

fixing-merge-conflicts

be86883

fixing-ci-fails

5fbdff1

fixing ruff formatting

672f071

Code rabbit follow up

6b00bd4

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

code rabbit fixes

9fc7545

fixing ruff formatting

6286972

Fix tokenizer tests and improve sentencepiece tokenizer

e17ab23

Fix test_verify.py to write manifest during setup

05c6b3d

github-actions bot added enhancement New feature or request backend configuration python size/L external-contributor pending-coderabbit-review labels Mar 17, 2026

Fix linting issue in sentencepiece_tokenizer.py

3123770

github-actions bot added size/L and removed size/L labels Mar 17, 2026

coderabbitai bot requested changes Mar 17, 2026

View reviewed changes

github-actions bot added size/L and removed size/L labels Mar 17, 2026

Fix duplicate pytest options in pyproject.toml

6af29e4

github-actions bot added size/L and removed size/L labels Mar 17, 2026

Shubhamx404 marked this pull request as draft March 17, 2026 17:57

Archit381 and others added 11 commits March 18, 2026 00:57

Merge pull request AOSSIE-Org#53 from aniket866/chain

d5ff5ef

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.

Merge branch 'main' into resume-preprocessing

40e2f40

Code rabbit follow up

e2c3017

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Code rabbit follow-up

e35f8ab

Code rabbit follow-up

a754edd

fixing-comput_sha

66bbeab

Merge pull request AOSSIE-Org#68 from aniket866/resume-preprocessing

173cdd5

Incremental Preprocessing with Checkpoints to avoid halfway fail-restart

Merge main into fix-malformed-xml-test and resolve conflicts

b8b81af

style: run formatter to fix CI lint issue

c1a4ef7

Merge pull request AOSSIE-Org#49 from Shubhamx404/fix-malformed-xml-test

578bc79

Add malformed XML test and fix verify.py regression

Merge origin/fix-tokenizer-tests and resolve conflicts, keeping tests

890d77c

github-actions bot added size/XL and removed size/L labels Mar 18, 2026

Shubhamx404 marked this pull request as ready for review March 18, 2026 14:34

github-actions bot added size/XL and removed size/XL labels Mar 18, 2026

Shubhamx404 marked this pull request as draft March 18, 2026 14:40

coderabbitai bot requested changes Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof#78

[FIX] : Add Unit Tests for create_tokenizer and load_merkle_proof#78
Shubhamx404 wants to merge 36 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-tokenizer-tests

Shubhamx404 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

Uh oh!

Shubhamx404 commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Shubhamx404 commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Overview

Solution

Changes Made

1. Export create_tokenizer

Tests for create_tokenizer (tests/test_tokenizer.py)

Tests for load_merkle_proof (tests/test_util.py)

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shubhamx404 commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shubhamx404 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

1. Export `create_tokenizer`