Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions experiments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
## Experiments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a top-level heading (h1) as the first line.

The static analysis tool flagged that the first line should be a top-level heading. Change ## Experiments to # Experiments.

📝 Proposed fix
-## Experiments
+# Experiments
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Experiments
# Experiments
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 1-1: First line in a file should be a top-level heading

(MD041, first-line-heading, first-line-h1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/README.md` at line 1, Change the first line heading from a
level-2 to a level-1 heading by replacing "## Experiments" with "# Experiments"
so the README starts with a top-level h1; update the header text at the top of
the file (the "## Experiments" line) to "# Experiments".


This directory contains small reproducible experiments used to validate assumptions behind the **OpenVerifiableLLM deterministic training pipeline**.

The goal of these experiments is to verify that:

- preprocessing produces deterministic outputs
- dataset tampering can be detected using Merkle roots
- small reproducible datasets can be used for testing the pipeline

These experiments are **not part of the main pipeline**. They are intended for testing ideas and validating reproducibility guarantees.

---

## Directory Structure

experiments/
├── data_subset/
│ ├── sample_wiki_generate.py
│ ├── sample_wiki.xml.bz2
│ └── tampered_sample_wiki.xml.bz2
├── preprocessing_determinism/
│ └── test_preprocessing.py
├── merkle_verification/
│ └── test_merkle.py
└── README.md

---

## Experiments includes

### 1. Preprocessing Determinism

Verifies that running the preprocessing pipeline multiple times on the same dataset produces identical outputs.

The experiment compares:

- `processed_sha256`
- `processed_merkle_root`
- `environment_hash`

If these values match across runs, the preprocessing step is deterministic.

Run:

```bash
python -m experiments.preprocessing_determinism.test_preprocessing experiments/data_subset/sample_wiki.xml.bz2
```

**Expected Results** -

```bash
Run 1 hash: ...
Run 2 hash: ...

Deterministic preprocessing confirmed 🎉
```

### 2. Merkle Root Tamper Detection

Tests whether dataset tampering is detected by comparing Merkle roots.

Two datasets are used:

sample_wiki.xml.bz2 (original)

tampered_sample_wiki.xml.bz2 (modified)

The experiment compares:

raw_merkle_root

processed_merkle_root

Comment on lines +69 to +78
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Format list items consistently using Markdown syntax.

Lines 69-78 are missing Markdown list formatting. They should use - or * prefixes for consistency with the rest of the document.

📝 Proposed fix
 Two datasets are used:
 
-sample_wiki.xml.bz2 (original)
+- `sample_wiki.xml.bz2` (original)
 
-tampered_sample_wiki.xml.bz2 (modified)
+- `tampered_sample_wiki.xml.bz2` (modified)
 
 The experiment compares:
 
-raw_merkle_root
+- `raw_merkle_root`
 
-processed_merkle_root
+- `processed_merkle_root`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sample_wiki.xml.bz2 (original)
tampered_sample_wiki.xml.bz2 (modified)
The experiment compares:
raw_merkle_root
processed_merkle_root
Two datasets are used:
- `sample_wiki.xml.bz2` (original)
- `tampered_sample_wiki.xml.bz2` (modified)
The experiment compares:
- `raw_merkle_root`
- `processed_merkle_root`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/README.md` around lines 69 - 78, Convert the plain text items
into a proper Markdown list by prefixing each entry with a consistent list
marker (e.g., "- "): update "sample_wiki.xml.bz2 (original)",
"tampered_sample_wiki.xml.bz2 (modified)", and the two comparison keys
"raw_merkle_root" and "processed_merkle_root" so they appear as list items under
the sentence "The experiment compares:". Ensure the same marker is used for all
four lines to match the document's Markdown style.

If either root differs, the tampering is successfully detected.

Run:

```bash
python -m experiments.merkle_verification.test_merkle --path1 experiments/data_subset/sample_wiki.xml.bz2 --path2 experiments/data_subset/tampered_sample_wiki.xml.bz2
```

**Expected Results** -

```bash
Run 1 RAW Merkle root: ...
Run 2 RAW Merkle root: ...

Tampering detected 🎉
```

### 3. Dataset Subset

The data_subset directory contains a minimal Wikipedia XML example used for quick experimentation without downloading full dumps.

This allows experiments to run quickly while still exercising the preprocessing pipeline.
20 changes: 20 additions & 0 deletions experiments/data_subset/sample_wiki_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import bz2

# To make this tampered I deleted e of online
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor typo in comment.

📝 Proposed fix
-# To make this tampered I deleted e of online
+# To make this tampered, I deleted 'e' from "online"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# To make this tampered I deleted e of online
# To make this tampered, I deleted 'e' from "online"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/data_subset/sample_wiki_generate.py` at line 3, Fix the minor
typo in the inline comment in sample_wiki_generate.py: replace the current
comment "# To make this tampered I deleted e of online" with a clearer phrasing
such as "# To make this tampered I deleted the 'e' in 'online'." to correct
grammar and clarify intent; update the comment near the tampering logic or the
top of the file where this sentence appears.


xml_content = """<?xml version="1.0" encoding="UTF-8"?>
<mediawiki>
<page>
<revision>
<text>
Hello <ref>citation</ref> world.
This is [[Python|programming language]]
{{Wikipedia }}is a free onlin encyclopedia.
</text>
</revision>
</page>
</mediawiki>
"""

with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
f.write(xml_content)
Comment on lines +19 to +20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Wrap file-writing logic in an if __name__ == "__main__" guard.

The script executes file I/O at module import time. This can cause unintended side effects if the module is imported elsewhere. Adding a main guard is a best practice for executable scripts.

♻️ Proposed fix
-with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
-    f.write(xml_content)
+if __name__ == "__main__":
+    with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
+        f.write(xml_content)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
f.write(xml_content)
if __name__ == "__main__":
with bz2.open("experiments/data_subset/tampered_sample_wiki.xml.bz2", "wt", encoding="utf-8") as f:
f.write(xml_content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/data_subset/sample_wiki_generate.py` around lines 19 - 20, The
file currently performs file I/O (the with bz2.open(...) block that writes
xml_content to "experiments/data_subset/tampered_sample_wiki.xml.bz2") at import
time; move that write block into an if __name__ == "__main__": guard so the
write only runs when the script is executed directly, ensuring xml_content is
still defined or constructed above the guard (or by a helper function you call
from the guard) and preventing side effects on import.

63 changes: 63 additions & 0 deletions experiments/merkle_verification/test_merkle.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import argparse
import json
import logging
from pathlib import Path

from openverifiablellm.utils import extract_text_from_xml

logger = logging.getLogger(__name__)

"""
Experiment: Tamper Detection via Merkle Root Comparison

Run with:
python -m experiments.merkle_verification.test_merkle --path1 experiments/data_subset/sample_wiki.xml.bz2 --path2 experiments/data_subset/tampered_sample_wiki.xml.bz2

"""
MANIFEST_PATH = Path("data/dataset_manifest.json")

def run(path1):
"""Run preprocessing and return processed Merkle root."""
extract_text_from_xml(path1)

#read genertaed manifest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo: "genertaed" → "generated".

📝 Proposed fix
-    `#read` genertaed manifest
+    # Read generated manifest
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#read genertaed manifest
# Read generated manifest
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/merkle_verification/test_merkle.py` at line 22, Update the
comment in experiments/merkle_verification/test_merkle.py that currently reads
"#read genertaed manifest" to correct the typo to "#read generated manifest" so
the inline comment in the test function (around the manifest-read logic)
accurately reflects the operation.

with MANIFEST_PATH.open() as f:
manifest = json.load(f)

return {
"raw_merkle_root": manifest["raw_merkle_root"],
"processed_merkle_root": manifest["processed_merkle_root"]
}

if __name__ == "__main__":

parser= argparse.ArgumentParser(
description= "Test tamper detection using Merkle root"
)

parser.add_argument("--path1",required=True,help="Original dataset")
parser.add_argument("--path2",required=True,help="Tampered dataset")

args= parser.parse_args()

logging.basicConfig(
level= logging.INFO,
format="%(levelname)s - %(message)s"
)
Comment on lines +34 to +46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Inconsistent spacing around = operators.

Several lines have inconsistent spacing around = which reduces readability.

✏️ Proposed fix
-    parser= argparse.ArgumentParser(
-        description= "Test tamper detection using Merkle root"
+    parser = argparse.ArgumentParser(
+        description="Test tamper detection using Merkle root"
     )
     
-    parser.add_argument("--path1",required=True,help="Original dataset")
-    parser.add_argument("--path2",required=True,help="Tampered dataset")
+    parser.add_argument("--path1", required=True, help="Original dataset")
+    parser.add_argument("--path2", required=True, help="Tampered dataset")
     
-    args= parser.parse_args()
+    args = parser.parse_args()
     
     logging.basicConfig(
-        level= logging.INFO,
+        level=logging.INFO,
         format="%(levelname)s - %(message)s"
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
parser= argparse.ArgumentParser(
description= "Test tamper detection using Merkle root"
)
parser.add_argument("--path1",required=True,help="Original dataset")
parser.add_argument("--path2",required=True,help="Tampered dataset")
args= parser.parse_args()
logging.basicConfig(
level= logging.INFO,
format="%(levelname)s - %(message)s"
)
parser = argparse.ArgumentParser(
description="Test tamper detection using Merkle root"
)
parser.add_argument("--path1", required=True, help="Original dataset")
parser.add_argument("--path2", required=True, help="Tampered dataset")
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(levelname)s - %(message)s"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/merkle_verification/test_merkle.py` around lines 34 - 46, The
code has inconsistent spacing around assignment and keyword-argument equals
(e.g., "parser= argparse.ArgumentParser", "args= parser.parse_args", and the
dict-like kwargs in logging.basicConfig). Update these lines so spacing is
consistent: use a single space before and after assignment operators (parser =
argparse.ArgumentParser, args = parser.parse_args) and standard spacing for
keyword arguments in function calls (description="...", level=logging.INFO,
format="..."), touching the symbols parser, argparse.ArgumentParser,
parser.add_argument, args, parser.parse_args, and logging.basicConfig to locate
and fix them.


root1 = run(args.path1)
root2 = run(args.path2)

print(f"\nRun 1 RAW Merkle root: {root1['raw_merkle_root']}")
print(f"Run 2 RAW Merkle root: {root2['raw_merkle_root']}")

print(f"\nRun 1 processed Merkle root: {root1['processed_merkle_root']}")
print(f"Run 2 processed Merkle root: {root2['processed_merkle_root']}")

if (
root1["raw_merkle_root"] != root2["raw_merkle_root"]
or root1["processed_merkle_root"] != root2["processed_merkle_root"]
):
print("\nTampering detected 🎉 (Merkle roots differ)")
else:
print("\nUnexpected result ❌ (Merkle roots identical)")
57 changes: 57 additions & 0 deletions experiments/preprocessing_determinism/test_preprocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import json
import logging
import sys
from pathlib import Path

from openverifiablellm.utils import extract_text_from_xml

logger = logging.getLogger(__name__)

"""
Experiment to test Deterministic preprocessing, by compairing generated hash on 2 runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo: "compairing" → "comparing".

📝 Proposed fix
-Experiment to test Deterministic preprocessing, by compairing generated hash on 2 runs.
+Experiment to test deterministic preprocessing by comparing generated hash on 2 runs.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Experiment to test Deterministic preprocessing, by compairing generated hash on 2 runs.
Experiment to test deterministic preprocessing by comparing generated hash on 2 runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/preprocessing_determinism/test_preprocessing.py` at line 10,
Update the descriptive comment in
experiments/preprocessing_determinism/test_preprocessing.py by fixing the typo:
replace "compairing" with "comparing" in the sentence "Experiment to test
Deterministic preprocessing, by compairing generated hash on 2 runs." so it
reads "Experiment to test Deterministic preprocessing, by comparing generated
hash on 2 runs." Ensure you edit the same comment or docstring where that phrase
appears.


Run with:
python -m experiments.preprocessing_determinism.test_preprocessing experiments/data_subset/sample_wiki.xml.bz2
"""
MANIFEST_PATH = Path("data/dataset_manifest.json")

def run(input_path):
# Run preprocessing
extract_text_from_xml(input_path)

#read genertaed manifest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo: "genertaed" → "generated".

📝 Proposed fix
-    `#read` genertaed manifest
+    # Read generated manifest
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#read genertaed manifest
# Read generated manifest
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/preprocessing_determinism/test_preprocessing.py` at line 21, Fix
the typo in the inline comment "#read genertaed manifest" by changing it to "#
read generated manifest" (or similar correct phrasing) so the comment
referencing the manifest read is clear; locate the comment in
experiments/preprocessing_determinism/test_preprocessing.py (look for the exact
string "read genertaed manifest") and update it accordingly.

with MANIFEST_PATH.open() as f:
manifest = json.load(f)

return {
"processed_sha256": manifest["processed_sha256"],
"processed_merkle_root": manifest["processed_merkle_root"],
"environment_hash": manifest["environment_hash"],
}
Comment on lines +19 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Manifest is never written—experiment will fail or use stale data.

extract_text_from_xml defaults to write_manifest=False (see openverifiablellm/utils.py:159). Since no write_manifest=True is passed, the manifest file is never created/updated, and the subsequent read on line 23-24 will either raise FileNotFoundError or read stale data from a previous unrelated run.

🐛 Proposed fix
 def run(input_path):
     # Run preprocessing
-    extract_text_from_xml(input_path)
+    extract_text_from_xml(input_path, write_manifest=True)
     
-    `#read` genertaed manifest
+    # Read generated manifest
     with MANIFEST_PATH.open() as f:
         manifest = json.load(f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Run preprocessing
extract_text_from_xml(input_path)
#read genertaed manifest
with MANIFEST_PATH.open() as f:
manifest = json.load(f)
return {
"processed_sha256": manifest["processed_sha256"],
"processed_merkle_root": manifest["processed_merkle_root"],
"environment_hash": manifest["environment_hash"],
}
# Run preprocessing
extract_text_from_xml(input_path, write_manifest=True)
# Read generated manifest
with MANIFEST_PATH.open() as f:
manifest = json.load(f)
return {
"processed_sha256": manifest["processed_sha256"],
"processed_merkle_root": manifest["processed_merkle_root"],
"environment_hash": manifest["environment_hash"],
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/preprocessing_determinism/test_preprocessing.py` around lines 19
- 30, The test calls extract_text_from_xml(input_path) but that function
defaults to write_manifest=False so no manifest is produced; change the call to
pass write_manifest=True (i.e., extract_text_from_xml(input_path,
write_manifest=True)) so MANIFEST_PATH is created/updated before opening it,
ensuring the subsequent json.load uses the manifest just written (keep existing
MANIFEST_PATH usage and existing return keys).



if __name__ == "__main__":

if len(sys.argv) < 2:
print("Usage: python -m experiments.preprocessing_determinism.test_preprocessing <input_dump>")
sys.exit(1)

logging.basicConfig(
level=logging.INFO,
format="%(levelname)s - %(message)s"
)
Comment on lines +39 to +42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Fix inconsistent indentation.

The arguments to logging.basicConfig have inconsistent indentation relative to the function call.

📝 Proposed fix
     logging.basicConfig(
-    level=logging.INFO,
-    format="%(levelname)s - %(message)s"
+        level=logging.INFO,
+        format="%(levelname)s - %(message)s"
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logging.basicConfig(
level=logging.INFO,
format="%(levelname)s - %(message)s"
)
logging.basicConfig(
level=logging.INFO,
format="%(levelname)s - %(message)s"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/preprocessing_determinism/test_preprocessing.py` around lines 38
- 41, The call to logging.basicConfig has inconsistent indentation for its
arguments; align the keyword arguments (level=logging.INFO,
format="%(levelname)s - %(message)s") vertically under the function call so
indentation is consistent with surrounding code (e.g., indent the two argument
lines to match the opening parenthesis of logging.basicConfig). Locate the
logging.basicConfig call in test_preprocessing.py and adjust the argument
indentation so the block is consistently formatted.


result1= run(sys.argv[1])
result2= run(sys.argv[1])
Comment on lines +44 to +45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add spaces around assignment operators for consistency.

Minor style: PEP 8 recommends spaces around = in assignments.

✏️ Proposed fix
-    result1= run(sys.argv[1])
-    result2= run(sys.argv[1])
+    result1 = run(sys.argv[1])
+    result2 = run(sys.argv[1])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result1= run(sys.argv[1])
result2= run(sys.argv[1])
result1 = run(sys.argv[1])
result2 = run(sys.argv[1])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/preprocessing_determinism/test_preprocessing.py` around lines 44
- 45, The two assignment statements use no spaces around the "="; update the
assignments to follow PEP8 by adding spaces around the operator for both
variables (result1 and result2) when calling run(sys.argv[1]). Locate the two
lines that invoke run and assign to result1 and result2 and insert spaces around
the "=" so the assignments read with standard spacing.


print(f"\nRun 1 hash: {result1['processed_sha256']}")
print(f"Run 2 hash: {result2['processed_sha256']}")

if (
result1["processed_sha256"] == result2["processed_sha256"]
and result1["processed_merkle_root"] == result2["processed_merkle_root"]
and result1["environment_hash"] == result2["environment_hash"]
):
print("\nDeterministic preprocessing confirmed🎉")
else:
print("Hash didn't match❌")
Loading