Feat/wikipedia factual evaluator#84
Feat/wikipedia factual evaluator#84chaitanyamedidar wants to merge 7 commits intoAOSSIE-Org:mainfrom
Conversation
WalkthroughAdds an evaluation framework: a BaseEvaluator protocol and two implementations — PerplexityEvaluator (dataset streaming, perplexity calculations) and WikipediaFactualEvaluator (counterfactual generation from wiki text and perplexity-based scoring). Exports updated, dependency added, and tests included. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PerplexityEvaluator
participant Dataset as "HuggingFace Dataset"
participant Tokenizer
participant Model
Client->>PerplexityEvaluator: evaluate(model, tokenizer)
PerplexityEvaluator->>Dataset: load(benchmark, split, streaming=True)
Dataset-->>PerplexityEvaluator: stream rows
loop up to n_samples
PerplexityEvaluator->>Tokenizer: encode(text)
Tokenizer-->>PerplexityEvaluator: token_ids
PerplexityEvaluator->>Model: forward(token_ids windowed)
Model-->>PerplexityEvaluator: logits
PerplexityEvaluator->>PerplexityEvaluator: compute log-probs, aggregate NLL
end
PerplexityEvaluator-->>Client: {"perplexity": mean_ppl}
sequenceDiagram
participant Client
participant WikiEvaluator as "WikipediaFactualEvaluator"
participant FS as "Wiki Text File"
participant Tokenizer
participant Perplexity as "PerplexityEvaluator"
participant Model
Client->>WikiEvaluator: evaluate(model, tokenizer)
WikiEvaluator->>FS: read(wiki_clean.txt)
FS-->>WikiEvaluator: raw text
WikiEvaluator->>WikiEvaluator: _extract_passages() -> [{original, counterfactual}, ...]
loop for each pair
WikiEvaluator->>Tokenizer: encode(original)
Tokenizer-->>WikiEvaluator: ids_orig
WikiEvaluator->>Tokenizer: encode(counterfactual)
Tokenizer-->>WikiEvaluator: ids_cf
WikiEvaluator->>Perplexity: compute_sentence_perplexity(model, ids_orig)
Perplexity->>Model: forward(ids_orig)
Model-->>Perplexity: logits
Perplexity-->>WikiEvaluator: factual_ppl
WikiEvaluator->>Perplexity: compute_sentence_perplexity(model, ids_cf)
Perplexity->>Model: forward(ids_cf)
Model-->>Perplexity: logits
Perplexity-->>WikiEvaluator: counterfactual_ppl
end
WikiEvaluator-->>Client: {"factual_perplexity": fp, "counterfactual_perplexity": cp, "factual_score": cp-fp}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/eval/base.py`:
- Around line 7-24: Add typing.Protocol-based type hints for the evaluate method
to improve IDE/static analysis: define Protocols (e.g., Model and Tokenizer)
describing the callable model (__call__(input_ids: list[int]) ->
list[list[float]]) and the tokenizer (encode(text: str) -> list[int]) and import
required types from typing/typing_extensions; then update the abstractmethod
signature of evaluate to accept model: Model and tokenizer: Tokenizer and return
dict. Ensure the Protocol definitions and imports are placed near the top of
openverifiablellm/eval/base.py and referenced in the evaluate method signature.
In `@openverifiablellm/eval/factual/factual_consistency.py`:
- Around line 85-95: The replacement can accidentally match substrings; instead
of str.replace(found_entity, substitute, 1) use a regex substitution anchored to
word boundaries so only the whole found_entity token is replaced: build a
pattern using re.escape(found_entity) wrapped with r'\b...\b' and call
re.sub(pattern, substitute, sentence, count=1). Update the block that uses
_ENTITY_RE, matches, found_entity, substitute, and candidate_entities to perform
this regex-based replacement (ensure re is imported where this module uses it).
- Around line 217-235: The score calculation can produce inf/nan when
compute_sentence_perplexity returns non-finite values; in the loop over pairs
(referencing pairs, tokenizer.encode,
PerplexityEvaluator.compute_sentence_perplexity and the lists factual_ppls,
counterfactual_ppls, score_diffs) check math.isfinite(factual_ppl) and
math.isfinite(cf_ppl) before appending or computing cf_ppl - factual_ppl and
skip that pair if either is non-finite; after the loop handle the case n==0 by
returning sane defaults (e.g., float("nan") for each metric) to avoid division
by zero and propagation of nan.
In `@openverifiablellm/eval/perplexity.py`:
- Around line 201-215: The code currently hardcodes load_dataset(self.benchmark,
split="test", streaming=True) which can fail for benchmarks without a "test"
split; update the logic in perplexity.py to accept a configurable split (e.g.,
self.split) and/or implement a safe fallback sequence (try "test", then
"validation", then "train") when calling hf_datasets.load_dataset, and raise a
clear error if none exist; modify the call site where ds is created and ensure
downstream logic using ds (the loop that uses tokenizer.encode and
self.n_samples and compute_sequence_perplexity) remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4348df36-40bf-47e6-a433-89ccbef3f87e
📒 Files selected for processing (7)
openverifiablellm/eval/__init__.pyopenverifiablellm/eval/base.pyopenverifiablellm/eval/factual/__init__.pyopenverifiablellm/eval/factual/factual_consistency.pyopenverifiablellm/eval/perplexity.pypyproject.tomltests/test_factual_eval.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/eval/factual/factual_consistency.py`:
- Around line 205-206: The evaluate() function currently calls random.seed(42)
which mutates global RNG; instead create a local RNG like rng =
random.Random(42), remove the global seed call, and pass this rng into
_extract_passages and any downstream substitution logic so all random operations
use rng (e.g., replace random.sample/random.choice calls inside
_extract_passages, substitution functions, or helper methods with
rng.sample/rng.choice). Update signatures for _extract_passages and any helper
functions to accept an rng parameter (defaulting to None or Random() if needed)
and thread it through to ensure determinism without touching global state.
In `@openverifiablellm/eval/perplexity.py`:
- Around line 125-161: The function compute_sequence_perplexity uses stride as
the loop step but does not validate it; add an early check at the top of
compute_sequence_perplexity to ensure stride is an integer >= 1 (e.g., if not
isinstance(stride, int) or stride < 1: raise ValueError("stride must be an
integer >= 1")), so zero or negative values (and non-integers) are rejected
before the for start in range(...) is executed.
- Around line 221-224: The loop currently hardcodes row.get("text") so add a
configurable parameter text_field (default "text") to the surrounding function
(the perplexity evaluator function that contains the loop over ds) and use
row.get(text_field, "") instead; validate after iterating a sample or the
dataset that at least one non-empty text was found and raise a clear
configuration error (or log and return) if none are present; propagate the new
text_field parameter through any public wrapper/CLI callers so benchmarks using
"content", "body", "article", etc. work correctly and update any related tests
to pass the alternative field name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0e4a5eda-8185-4706-ab84-9727eeedaeeb
📒 Files selected for processing (3)
openverifiablellm/eval/base.pyopenverifiablellm/eval/factual/factual_consistency.pyopenverifiablellm/eval/perplexity.py
| random.seed(42) | ||
| pairs = self._extract_passages(self.wiki_text_path, self.n_samples) |
There was a problem hiding this comment.
Avoid mutating global RNG state in evaluate().
random.seed(42) resets process-wide randomness and can affect unrelated code paths. Use a local random.Random(42) instance and thread it through substitution logic.
Proposed fix
- `@staticmethod`
- def _substitute_entity(sentence: str, candidate_entities: List[str]) -> Optional[str]:
+ `@staticmethod`
+ def _substitute_entity(
+ sentence: str, candidate_entities: List[str], rng: random.Random
+ ) -> Optional[str]:
@@
- substitute = random.choice(alternatives)
+ substitute = rng.choice(alternatives)
@@
- def _extract_passages(
+ def _extract_passages(
wiki_text_path: Union[str, Path],
n_samples: Optional[int],
+ rng: random.Random,
) -> List[dict]:
@@
- counterfactual = WikipediaFactualEvaluator._substitute_entity(
- sentence, all_entities
- )
+ counterfactual = WikipediaFactualEvaluator._substitute_entity(
+ sentence, all_entities, rng
+ )
@@
- random.seed(42)
- pairs = self._extract_passages(self.wiki_text_path, self.n_samples)
+ rng = random.Random(42)
+ pairs = self._extract_passages(self.wiki_text_path, self.n_samples, rng)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/eval/factual/factual_consistency.py` around lines 205 -
206, The evaluate() function currently calls random.seed(42) which mutates
global RNG; instead create a local RNG like rng = random.Random(42), remove the
global seed call, and pass this rng into _extract_passages and any downstream
substitution logic so all random operations use rng (e.g., replace
random.sample/random.choice calls inside _extract_passages, substitution
functions, or helper methods with rng.sample/rng.choice). Update signatures for
_extract_passages and any helper functions to accept an rng parameter
(defaulting to None or Random() if needed) and thread it through to ensure
determinism without touching global state.
| def compute_sequence_perplexity(model, token_ids: List[int], stride: int = 512) -> float: | ||
| """ | ||
| Compute perplexity over a (possibly long) sequence using non-overlapping | ||
| stride-sized windows. | ||
|
|
||
| The sequence is partitioned into windows of *stride* tokens. Each | ||
| window contributes its token predictions to a pooled NLL. The final | ||
| perplexity is ``exp(total_NLL / total_scored_tokens)``. | ||
|
|
||
| For sequences shorter than *stride* + 1 tokens the result is | ||
| identical to :meth:`compute_sentence_perplexity`. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| model : callable | ||
| ``model(input_ids) -> 2-D sequence`` of shape | ||
| ``(len(input_ids), vocab_size)``. | ||
| token_ids : list[int] | ||
| Tokenised sequence. | ||
| stride : int | ||
| Number of tokens scored per window. Default ``512``. | ||
|
|
||
| Returns | ||
| ------- | ||
| float | ||
| Perplexity (≥ 1). Returns ``float("inf")`` for sequences | ||
| shorter than 2 tokens. | ||
| """ | ||
| if len(token_ids) < 2: | ||
| return float("inf") | ||
|
|
||
| nll_sum = 0.0 | ||
| n_scored = 0 | ||
| n = len(token_ids) | ||
|
|
||
| for start in range(0, n - 1, stride): | ||
| end = min(start + stride + 1, n) |
There was a problem hiding this comment.
Validate stride before using it as a window step.
stride is user-configurable; stride=0 raises at runtime and negative values yield invalid scoring behavior. Guard it early.
Proposed fix
def __init__(
self,
benchmark: str = "wikitext",
n_samples: Optional[int] = 50,
stride: int = 512,
split: Optional[str] = None,
):
+ if stride <= 0:
+ raise ValueError("stride must be a positive integer.")
self.benchmark = benchmark
self.n_samples = n_samples
self.stride = stride
self.split = split
@@
`@staticmethod`
def compute_sequence_perplexity(model, token_ids: List[int], stride: int = 512) -> float:
+ if stride <= 0:
+ raise ValueError("stride must be a positive integer.")
if len(token_ids) < 2:
return float("inf")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/eval/perplexity.py` around lines 125 - 161, The function
compute_sequence_perplexity uses stride as the loop step but does not validate
it; add an early check at the top of compute_sequence_perplexity to ensure
stride is an integer >= 1 (e.g., if not isinstance(stride, int) or stride < 1:
raise ValueError("stride must be an integer >= 1")), so zero or negative values
(and non-integers) are rejected before the for start in range(...) is executed.
| for row in ds: | ||
| text = row.get("text", "") | ||
| if not text.strip(): | ||
| continue |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify actual row keys for the configured benchmark/split(s).
# Update BENCHMARK as needed.
python - <<'PY'
import datasets
BENCHMARK = "wikitext"
splits = ["test", "validation", "train"]
for s in splits:
try:
ds = datasets.load_dataset(BENCHMARK, split=s, streaming=True)
row = next(iter(ds))
print(f"{s}: keys={list(row.keys())}")
except Exception as e:
print(f"{s}: unavailable ({type(e).__name__}: {e})")
PYRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 192
🏁 Script executed:
# First, locate the file and examine it
git ls-files | grep perplexity.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 108
🏁 Script executed:
# Read the perplexity.py file to see the implementation
wc -l openverifiablellm/eval/perplexity.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 112
🏁 Script executed:
# Read the entire file with line numbers
cat -n openverifiablellm/eval/perplexity.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 9490
🏁 Script executed:
# Search for test files related to perplexity
find . -name "*test*" -type f | grep -i perplexRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 54
🏁 Script executed:
# Search for usage of PerplexityEvaluator
rg "PerplexityEvaluator" -nRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1158
🏁 Script executed:
# Check the test file to see how PerplexityEvaluator is being used
cat -n tests/test_factual_eval.py | head -100Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 3445
🏁 Script executed:
# Check factual_consistency.py for how evaluate() is used
cat -n openverifiablellm/eval/factual/factual_consistency.py | head -250Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 11115
🏁 Script executed:
# Search for calls to .evaluate() method that might use the "text" field
rg "\.evaluate\(" -n | grep -i perplexRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 54
🏁 Script executed:
# Search for dataset loading patterns and benchmark usage
rg "load_dataset" -n -A 2 -B 2Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 895
🏁 Script executed:
# Check if there's any documentation or README about supported benchmarks
find . -name "README*" -o -name "*.md" | head -20Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 136
🏁 Script executed:
# Check README for information about supported benchmarks
cat -n README.md | head -150Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 5896
🏁 Script executed:
# Search for any test cases or examples that use different benchmarks
rg "benchmark" -n openverifiablellm/eval/ | head -30Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 884
🏁 Script executed:
# Check if there are any other references to field names or text extraction
rg "get\(" openverifiablellm/eval/perplexity.pyRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 110
🏁 Script executed:
# Let's search for any comments or documentation about dataset field names
rg "field" -n openverifiablellm/eval/Repository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 54
Add configurable text_field parameter to support benchmarks with different field names.
Line 222 hardcodes the "text" field name when extracting text from dataset rows. For Hugging Face datasets that use different field names (e.g., content, body, article), all rows are silently skipped and the evaluator returns inf instead of surfacing the misconfiguration.
Proposed fix
def __init__(
self,
benchmark: str = "wikitext",
n_samples: Optional[int] = 50,
stride: int = 512,
split: Optional[str] = None,
+ text_field: str = "text",
):
self.benchmark = benchmark
self.n_samples = n_samples
self.stride = stride
self.split = split
+ self.text_field = text_field
@@
- text = row.get("text", "")
+ text = row.get(self.text_field, "")
if not text.strip():
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for row in ds: | |
| text = row.get("text", "") | |
| if not text.strip(): | |
| continue | |
| def __init__( | |
| self, | |
| benchmark: str = "wikitext", | |
| n_samples: Optional[int] = 50, | |
| stride: int = 512, | |
| split: Optional[str] = None, | |
| text_field: str = "text", | |
| ): | |
| self.benchmark = benchmark | |
| self.n_samples = n_samples | |
| self.stride = stride | |
| self.split = split | |
| self.text_field = text_field | |
| # ... in the evaluate() method: | |
| for row in ds: | |
| text = row.get(self.text_field, "") | |
| if not text.strip(): | |
| continue |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/eval/perplexity.py` around lines 221 - 224, The loop
currently hardcodes row.get("text") so add a configurable parameter text_field
(default "text") to the surrounding function (the perplexity evaluator function
that contains the loop over ds) and use row.get(text_field, "") instead;
validate after iterating a sample or the dataset that at least one non-empty
text was found and raise a clear configuration error (or log and return) if none
are present; propagate the new text_field parameter through any public
wrapper/CLI callers so benchmarks using "content", "body", "article", etc. work
correctly and update any related tests to pass the alternative field name.
Addressed Issues:
Implement Wikipedia-Based Factual Consistency Evaluator
implements the factual accuracy evaluation metric listed as a success criterion in the project specification
Screenshots/Recordings:
Additional Notes:
The project specification requires evaluating trained models on factual accuracy. Unlike existing factual benchmarks, this evaluator uses Wikipedia itself as the source of truth, directly consistent with the project's core requirement that Wikipedia is the sole trusted data source.
How it works:
wiki_clean.txtproduced by the existing pipelineWhy this approach over existing benchmarks:
Determinism:
random.seed(42)is applied insideevaluate()before any entity selection, ensuring fully reproducible evaluation results, consistent with the project's core reproducibility requirementFiles changed:
openverifiablellm/eval/factual/factual_consistency.py: WikipediaFactualEvaluatoropenverifiablellm/eval/factual/__init__.py: exports WikipediaFactualEvaluatoropenverifiablellm/eval/base.py: abstract BaseEvaluatoropenverifiablellm/eval/perplexity.py: PerplexityEvaluator with shared compute_sentence_perplexity static methodopenverifiablellm/eval/__init__.py: exports all evaluatorstests/test_factual_eval.py: 10 tests, all passing, no network calls, includes determinism testpyproject.toml: added datasets as runtime dependencyChecklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests
Chores