Skip to content

refactor: deduplicate _canonical_json into shared _hashing module#81

Open
chaitanyamedidar wants to merge 1 commit intoAOSSIE-Org:mainfrom
chaitanyamedidar:fix/deduplicate-canonical-json
Open

refactor: deduplicate _canonical_json into shared _hashing module#81
chaitanyamedidar wants to merge 1 commit intoAOSSIE-Org:mainfrom
chaitanyamedidar:fix/deduplicate-canonical-json

Conversation

@chaitanyamedidar
Copy link

@chaitanyamedidar chaitanyamedidar commented Mar 21, 2026

Addressed Issues:

proactive refactor to eliminate code duplication
identified during codebase audit

Screenshots/Recordings:

Additional Notes:

_canonical_json was defined identically in both environment.py and
manifest_chain.py. This PR moves it into a new shared internal module
openverifiablellm/_hashing.py and updates both files to import from there

No logic, signatures, or behavior has changed. The function remains
importable from its original locations via namespace, so no external
callers are broken

Pre existing failing test: test_report_stores_input_dump_path fails on
Windows due to a short-path name issue. This failure exists on main
without these changes and is unrelated to this PR

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Introduced evaluation framework with abstract BaseEvaluator interface
    • Added PerplexityEvaluator for computing language model perplexity metrics
    • Added BiasEvaluator for bias benchmarks (WinoBias, BBQ)
    • Added BenchmarkEvaluator for standardized benchmarks (MMLU, TriviaQA)
  • Refactor

    • Extracted canonical JSON serialization into dedicated utility module
  • Tests

    • Added comprehensive test suite for evaluation framework

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Walkthrough

This PR introduces a new evaluation framework for LLM assessment with abstract and concrete evaluator classes, adds deterministic perplexity computation, and refactors canonical JSON hashing into a centralized dependency-free module for reuse across components.

Changes

Cohort / File(s) Summary
Hashing Module Centralization
openverifiablellm/_hashing.py, openverifiablellm/environment.py, openverifiablellm/manifest_chain.py
Extracted _canonical_json helper into a new standalone module; updated environment.py and manifest_chain.py to import and use it instead of maintaining local implementations.
Evaluation Framework Core
openverifiablellm/eval/__init__.py, openverifiablellm/eval/base.py
Created new eval package with BaseEvaluator abstract base class and public re-exports of concrete evaluators for use by consumers.
Perplexity Evaluator
openverifiablellm/eval/perplexity.py
Implemented sliding-window negative log-likelihood computation and PerplexityEvaluator class with token-level metrics (perplexity, NLL, bits-per-byte) and deterministic uniform model for testing.
Bias & Benchmark Evaluators
openverifiablellm/eval/bias.py, openverifiablellm/eval/benchmarks.py
Added BiasEvaluator and BenchmarkEvaluator stub classes with configuration validation; evaluate() methods raise NotImplementedError pending future integration.
Evaluation Test Suite
tests/test_eval.py
Comprehensive test coverage for evaluator interfaces, constructor validation, metric computation, edge cases, and abstract base class enforcement.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Test
    participant PE as PerplexityEvaluator
    participant Tokenizer
    participant Model
    participant Metrics as Metric Computation

    User->>PE: evaluate(model, tokenizer)
    PE->>Tokenizer: tokenize(text)
    Tokenizer-->>PE: token_ids
    
    PE->>PE: _sliding_window_nll(token_ids, model, ...)
    
    loop For each overlapping window
        PE->>Model: model(window_tokens)
        Model-->>PE: [log_prob_1, log_prob_2, ...]
        PE->>PE: accumulate NLL for new tokens
    end
    
    PE->>Metrics: compute mean_nll, perplexity, bits_per_byte
    Metrics-->>PE: Dict[str, float] metrics
    PE-->>User: {perplexity, nll_bits_per_byte, n_tokens, n_bytes}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 A hash centralized, now shared with glee,
Evaluators sprouting like carrots in rows,
Perplexity sliding through tokens so free,
While bias and benchmarks prepare their shows,
The framework now grows! 📊✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring: deduplicating the _canonical_json function into a shared _hashing module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes.

Add @coderabbitai placeholder anywhere in the title of your PR and CodeRabbit will replace it with a title based on the changes in the PR. You can change the placeholder by changing the reviews.auto_title_placeholder setting.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 21, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/perplexity.py`:
- Around line 210-217: The no-scored-token branch returns
n_tokens=len(token_ids) which contradicts the method contract expecting the
number of scored tokens (n_scored); update the returned dict in that branch to
set "n_tokens": n_scored (i.e., 0) instead of len(token_ids) and keep other
fields unchanged so callers receive the scored-token count consistently; locate
the branch using the variables n_scored and token_ids in the perplexity
computation function in perplexity.py to make this change.

In `@tests/test_eval.py`:
- Around line 208-211: Replace the duplicated uniform log-prob calculation in
TestSlidingWindowNll._uniform_model with a call to the existing module helper
uniform_log_probs: change the body of _uniform_model to return
uniform_log_probs(vocab_size) (keeping the same signature) so the test reuses
the shared helper instead of recreating lp logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ab3f7132-a41e-46ad-bc37-c7791dfb061b

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and 0d73682.

📒 Files selected for processing (9)
  • openverifiablellm/_hashing.py
  • openverifiablellm/environment.py
  • openverifiablellm/eval/__init__.py
  • openverifiablellm/eval/base.py
  • openverifiablellm/eval/benchmarks.py
  • openverifiablellm/eval/bias.py
  • openverifiablellm/eval/perplexity.py
  • openverifiablellm/manifest_chain.py
  • tests/test_eval.py

Comment on lines +210 to +217
if n_scored == 0:
# Edge case: single-token corpus — nothing to score.
logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")
return {
"perplexity": 1.0,
"nll_bits_per_byte": 0.0,
"n_tokens": len(token_ids),
"n_bytes": len(self.text.encode("utf-8")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

n_tokens is inconsistent in the no-scored-token edge case.

At Line 216, this branch reports input token count instead of scored token count, which conflicts with the method contract/docstring.

💡 Proposed fix
         if n_scored == 0:
             # Edge case: single-token corpus — nothing to score.
             logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")
             return {
                 "perplexity": 1.0,
                 "nll_bits_per_byte": 0.0,
-                "n_tokens": len(token_ids),
+                "n_tokens": 0,
                 "n_bytes": len(self.text.encode("utf-8")),
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if n_scored == 0:
# Edge case: single-token corpus — nothing to score.
logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")
return {
"perplexity": 1.0,
"nll_bits_per_byte": 0.0,
"n_tokens": len(token_ids),
"n_bytes": len(self.text.encode("utf-8")),
if n_scored == 0:
# Edge case: single-token corpus — nothing to score.
logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")
return {
"perplexity": 1.0,
"nll_bits_per_byte": 0.0,
"n_tokens": 0,
"n_bytes": len(self.text.encode("utf-8")),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/eval/perplexity.py` around lines 210 - 217, The
no-scored-token branch returns n_tokens=len(token_ids) which contradicts the
method contract expecting the number of scored tokens (n_scored); update the
returned dict in that branch to set "n_tokens": n_scored (i.e., 0) instead of
len(token_ids) and keep other fields unchanged so callers receive the
scored-token count consistently; locate the branch using the variables n_scored
and token_ids in the perplexity computation function in perplexity.py to make
this change.

Comment on lines +208 to +211
class TestSlidingWindowNll:
def _uniform_model(self, vocab_size=VOCAB_SIZE):
lp = math.log(1.0 / vocab_size)
return lambda ids: [lp] * len(ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider reusing the module-level uniform_log_probs helper.

The _uniform_model method duplicates the logic of uniform_log_probs defined at lines 43-50. You could simplify by reusing the existing helper:

def _uniform_model(self, vocab_size=VOCAB_SIZE):
    return uniform_log_probs(vocab_size)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_eval.py` around lines 208 - 211, Replace the duplicated uniform
log-prob calculation in TestSlidingWindowNll._uniform_model with a call to the
existing module helper uniform_log_probs: change the body of _uniform_model to
return uniform_log_probs(vocab_size) (keeping the same signature) so the test
reuses the shared helper instead of recreating lp logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants