refactor: deduplicate _canonical_json into shared _hashing module#81
refactor: deduplicate _canonical_json into shared _hashing module#81chaitanyamedidar wants to merge 1 commit intoAOSSIE-Org:mainfrom
Conversation
WalkthroughThis PR introduces a new evaluation framework for LLM assessment with abstract and concrete evaluator classes, adds deterministic perplexity computation, and refactors canonical JSON hashing into a centralized dependency-free module for reuse across components. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Test
participant PE as PerplexityEvaluator
participant Tokenizer
participant Model
participant Metrics as Metric Computation
User->>PE: evaluate(model, tokenizer)
PE->>Tokenizer: tokenize(text)
Tokenizer-->>PE: token_ids
PE->>PE: _sliding_window_nll(token_ids, model, ...)
loop For each overlapping window
PE->>Model: model(window_tokens)
Model-->>PE: [log_prob_1, log_prob_2, ...]
PE->>PE: accumulate NLL for new tokens
end
PE->>Metrics: compute mean_nll, perplexity, bits_per_byte
Metrics-->>PE: Dict[str, float] metrics
PE-->>User: {perplexity, nll_bits_per_byte, n_tokens, n_bytes}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can generate a title for your PR based on the changes.Add |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/eval/perplexity.py`:
- Around line 210-217: The no-scored-token branch returns
n_tokens=len(token_ids) which contradicts the method contract expecting the
number of scored tokens (n_scored); update the returned dict in that branch to
set "n_tokens": n_scored (i.e., 0) instead of len(token_ids) and keep other
fields unchanged so callers receive the scored-token count consistently; locate
the branch using the variables n_scored and token_ids in the perplexity
computation function in perplexity.py to make this change.
In `@tests/test_eval.py`:
- Around line 208-211: Replace the duplicated uniform log-prob calculation in
TestSlidingWindowNll._uniform_model with a call to the existing module helper
uniform_log_probs: change the body of _uniform_model to return
uniform_log_probs(vocab_size) (keeping the same signature) so the test reuses
the shared helper instead of recreating lp logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ab3f7132-a41e-46ad-bc37-c7791dfb061b
📒 Files selected for processing (9)
openverifiablellm/_hashing.pyopenverifiablellm/environment.pyopenverifiablellm/eval/__init__.pyopenverifiablellm/eval/base.pyopenverifiablellm/eval/benchmarks.pyopenverifiablellm/eval/bias.pyopenverifiablellm/eval/perplexity.pyopenverifiablellm/manifest_chain.pytests/test_eval.py
| if n_scored == 0: | ||
| # Edge case: single-token corpus — nothing to score. | ||
| logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0") | ||
| return { | ||
| "perplexity": 1.0, | ||
| "nll_bits_per_byte": 0.0, | ||
| "n_tokens": len(token_ids), | ||
| "n_bytes": len(self.text.encode("utf-8")), |
There was a problem hiding this comment.
n_tokens is inconsistent in the no-scored-token edge case.
At Line 216, this branch reports input token count instead of scored token count, which conflicts with the method contract/docstring.
💡 Proposed fix
if n_scored == 0:
# Edge case: single-token corpus — nothing to score.
logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0")
return {
"perplexity": 1.0,
"nll_bits_per_byte": 0.0,
- "n_tokens": len(token_ids),
+ "n_tokens": 0,
"n_bytes": len(self.text.encode("utf-8")),
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if n_scored == 0: | |
| # Edge case: single-token corpus — nothing to score. | |
| logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0") | |
| return { | |
| "perplexity": 1.0, | |
| "nll_bits_per_byte": 0.0, | |
| "n_tokens": len(token_ids), | |
| "n_bytes": len(self.text.encode("utf-8")), | |
| if n_scored == 0: | |
| # Edge case: single-token corpus — nothing to score. | |
| logger.warning("No tokens were scored (corpus too short); returning perplexity=1.0") | |
| return { | |
| "perplexity": 1.0, | |
| "nll_bits_per_byte": 0.0, | |
| "n_tokens": 0, | |
| "n_bytes": len(self.text.encode("utf-8")), |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/eval/perplexity.py` around lines 210 - 217, The
no-scored-token branch returns n_tokens=len(token_ids) which contradicts the
method contract expecting the number of scored tokens (n_scored); update the
returned dict in that branch to set "n_tokens": n_scored (i.e., 0) instead of
len(token_ids) and keep other fields unchanged so callers receive the
scored-token count consistently; locate the branch using the variables n_scored
and token_ids in the perplexity computation function in perplexity.py to make
this change.
| class TestSlidingWindowNll: | ||
| def _uniform_model(self, vocab_size=VOCAB_SIZE): | ||
| lp = math.log(1.0 / vocab_size) | ||
| return lambda ids: [lp] * len(ids) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider reusing the module-level uniform_log_probs helper.
The _uniform_model method duplicates the logic of uniform_log_probs defined at lines 43-50. You could simplify by reusing the existing helper:
def _uniform_model(self, vocab_size=VOCAB_SIZE):
return uniform_log_probs(vocab_size)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_eval.py` around lines 208 - 211, Replace the duplicated uniform
log-prob calculation in TestSlidingWindowNll._uniform_model with a call to the
existing module helper uniform_log_probs: change the body of _uniform_model to
return uniform_log_probs(vocab_size) (keeping the same signature) so the test
reuses the shared helper instead of recreating lp logic.
Addressed Issues:
proactive refactor to eliminate code duplication
identified during codebase audit
Screenshots/Recordings:
Additional Notes:
_canonical_jsonwas defined identically in bothenvironment.pyandmanifest_chain.py. This PR moves it into a new shared internal moduleopenverifiablellm/_hashing.pyand updates both files to import from thereNo logic, signatures, or behavior has changed. The function remains
importable from its original locations via namespace, so no external
callers are broken
Pre existing failing test:
test_report_stores_input_dump_pathfails onWindows due to a short-path name issue. This failure exists on main
without these changes and is unrelated to this PR
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
BaseEvaluatorinterfacePerplexityEvaluatorfor computing language model perplexity metricsBiasEvaluatorfor bias benchmarks (WinoBias, BBQ)BenchmarkEvaluatorfor standardized benchmarks (MMLU, TriviaQA)Refactor
Tests