refactor: deduplicate _canonical_json into shared _hashing module by chaitanyamedidar · Pull Request #81 · AOSSIE-Org/OpenVerifiableLLM

chaitanyamedidar · 2026-03-21T14:41:36Z

Addressed Issues:

proactive refactor to eliminate code duplication
identified during codebase audit

Screenshots/Recordings:

Additional Notes:

_canonical_json was defined identically in both environment.py and
manifest_chain.py. This PR moves it into a new shared internal module
openverifiablellm/_hashing.py and updates both files to import from there

No logic, signatures, or behavior has changed. The function remains
importable from its original locations via namespace, so no external
callers are broken

Pre existing failing test: test_report_stores_input_dump_path fails on
Windows due to a short-path name issue. This failure exists on main
without these changes and is unrelated to this PR

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Introduced evaluation framework with abstract BaseEvaluator interface
- Added PerplexityEvaluator for computing language model perplexity metrics
- Added BiasEvaluator for bias benchmarks (WinoBias, BBQ)
- Added BenchmarkEvaluator for standardized benchmarks (MMLU, TriviaQA)
Refactor
- Extracted canonical JSON serialization into dedicated utility module
Tests
- Added comprehensive test suite for evaluation framework

coderabbitai · 2026-03-21T14:41:49Z

Walkthrough

This PR introduces a new evaluation framework for LLM assessment with abstract and concrete evaluator classes, adds deterministic perplexity computation, and refactors canonical JSON hashing into a centralized dependency-free module for reuse across components.

Changes

Cohort / File(s)	Summary
Hashing Module Centralization `openverifiablellm/_hashing.py`, `openverifiablellm/environment.py`, `openverifiablellm/manifest_chain.py`	Extracted `_canonical_json` helper into a new standalone module; updated `environment.py` and `manifest_chain.py` to import and use it instead of maintaining local implementations.
Evaluation Framework Core `openverifiablellm/eval/__init__.py`, `openverifiablellm/eval/base.py`	Created new `eval` package with `BaseEvaluator` abstract base class and public re-exports of concrete evaluators for use by consumers.
Perplexity Evaluator `openverifiablellm/eval/perplexity.py`	Implemented sliding-window negative log-likelihood computation and `PerplexityEvaluator` class with token-level metrics (perplexity, NLL, bits-per-byte) and deterministic uniform model for testing.
Bias & Benchmark Evaluators `openverifiablellm/eval/bias.py`, `openverifiablellm/eval/benchmarks.py`	Added `BiasEvaluator` and `BenchmarkEvaluator` stub classes with configuration validation; `evaluate()` methods raise `NotImplementedError` pending future integration.
Evaluation Test Suite `tests/test_eval.py`	Comprehensive test coverage for evaluator interfaces, constructor validation, metric computation, edge cases, and abstract base class enforcement.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Test
    participant PE as PerplexityEvaluator
    participant Tokenizer
    participant Model
    participant Metrics as Metric Computation

    User->>PE: evaluate(model, tokenizer)
    PE->>Tokenizer: tokenize(text)
    Tokenizer-->>PE: token_ids
    
    PE->>PE: _sliding_window_nll(token_ids, model, ...)
    
    loop For each overlapping window
        PE->>Model: model(window_tokens)
        Model-->>PE: [log_prob_1, log_prob_2, ...]
        PE->>PE: accumulate NLL for new tokens
    end
    
    PE->>Metrics: compute mean_nll, perplexity, bits_per_byte
    Metrics-->>PE: Dict[str, float] metrics
    PE-->>User: {perplexity, nll_bits_per_byte, n_tokens, n_bytes}

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests. #53: Implemented the original _canonical_json helper and manifest hashing verification; this PR centralizes that helper into a shared module for broader reuse.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 A hash centralized, now shared with glee,
Evaluators sprouting like carrots in rows,
Perplexity sliding through tokens so free,
While bias and benchmarks prepare their shows,
The framework now grows! 📊✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main refactoring: deduplicating the _canonical_json function into a shared _hashing module.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can generate a title for your PR based on the changes.

Add @coderabbitai placeholder anywhere in the title of your PR and CodeRabbit will replace it with a title based on the changes in the PR. You can change the placeholder by changing the reviews.auto_title_placeholder setting.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/perplexity.py`:
- Around line 210-217: The no-scored-token branch returns
n_tokens=len(token_ids) which contradicts the method contract expecting the
number of scored tokens (n_scored); update the returned dict in that branch to
set "n_tokens": n_scored (i.e., 0) instead of len(token_ids) and keep other
fields unchanged so callers receive the scored-token count consistently; locate
the branch using the variables n_scored and token_ids in the perplexity
computation function in perplexity.py to make this change.

In `@tests/test_eval.py`:
- Around line 208-211: Replace the duplicated uniform log-prob calculation in
TestSlidingWindowNll._uniform_model with a call to the existing module helper
uniform_log_probs: change the body of _uniform_model to return
uniform_log_probs(vocab_size) (keeping the same signature) so the test reuses
the shared helper instead of recreating lp logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ab3f7132-a41e-46ad-bc37-c7791dfb061b

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and 0d73682.

📒 Files selected for processing (9)

openverifiablellm/_hashing.py
openverifiablellm/environment.py
openverifiablellm/eval/__init__.py
openverifiablellm/eval/base.py
openverifiablellm/eval/benchmarks.py
openverifiablellm/eval/bias.py
openverifiablellm/eval/perplexity.py
openverifiablellm/manifest_chain.py
tests/test_eval.py

coderabbitai · 2026-03-21T14:46:22Z

openverifiablellm/eval/perplexity.py

+        if n_scored == 0:
+            # Edge case: single-token corpus — nothing to score.
+            logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")
+            return {
+                "perplexity": 1.0,
+                "nll_bits_per_byte": 0.0,
+                "n_tokens": len(token_ids),
+                "n_bytes": len(self.text.encode("utf-8")),


⚠️ Potential issue | 🟡 Minor

n_tokens is inconsistent in the no-scored-token edge case.

At Line 216, this branch reports input token count instead of scored token count, which conflicts with the method contract/docstring.

💡 Proposed fix

if n_scored == 0: # Edge case: single-token corpus — nothing to score. logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0") return { "perplexity": 1.0, "nll_bits_per_byte": 0.0, - "n_tokens": len(token_ids), + "n_tokens": 0, "n_bytes": len(self.text.encode("utf-8")), }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if n_scored == 0:

# Edge case: single-token corpus — nothing to score.

logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")

return {

"perplexity": 1.0,

"nll_bits_per_byte": 0.0,

"n_tokens": len(token_ids),

"n_bytes": len(self.text.encode("utf-8")),

if n_scored == 0:

# Edge case: single-token corpus — nothing to score.

logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")

return {

"perplexity": 1.0,

"nll_bits_per_byte": 0.0,

"n_tokens": 0,

"n_bytes": len(self.text.encode("utf-8")),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/eval/perplexity.py` around lines 210 - 217, The no-scored-token branch returns n_tokens=len(token_ids) which contradicts the method contract expecting the number of scored tokens (n_scored); update the returned dict in that branch to set "n_tokens": n_scored (i.e., 0) instead of len(token_ids) and keep other fields unchanged so callers receive the scored-token count consistently; locate the branch using the variables n_scored and token_ids in the perplexity computation function in perplexity.py to make this change.

coderabbitai · 2026-03-21T14:46:22Z

tests/test_eval.py

+class TestSlidingWindowNll:
+    def _uniform_model(self, vocab_size=VOCAB_SIZE):
+        lp = math.log(1.0 / vocab_size)
+        return lambda ids: [lp] * len(ids)


🧹 Nitpick | 🔵 Trivial

Consider reusing the module-level uniform_log_probs helper.

The _uniform_model method duplicates the logic of uniform_log_probs defined at lines 43-50. You could simplify by reusing the existing helper:

def _uniform_model(self, vocab_size=VOCAB_SIZE): return uniform_log_probs(vocab_size)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_eval.py` around lines 208 - 211, Replace the duplicated uniform log-prob calculation in TestSlidingWindowNll._uniform_model with a call to the existing module helper uniform_log_probs: change the body of _uniform_model to return uniform_log_probs(vocab_size) (keeping the same signature) so the test reuses the shared helper instead of recreating lp logic.

refactor: deduplicate _canonical_json into shared _hashing module

0d73682

github-actions bot added no-issue-linked backend python size/XL first-time-contributor pending-coderabbit-review labels Mar 21, 2026

github-actions bot added size/XL and removed size/XL labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: deduplicate _canonical_json into shared _hashing module#81

refactor: deduplicate _canonical_json into shared _hashing module#81
chaitanyamedidar wants to merge 1 commit intoAOSSIE-Org:mainfrom
chaitanyamedidar:fix/deduplicate-canonical-json

chaitanyamedidar commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chaitanyamedidar commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaitanyamedidar commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading