Feat/wikipedia factual evaluator by chaitanyamedidar · Pull Request #84 · AOSSIE-Org/OpenVerifiableLLM

chaitanyamedidar · 2026-03-21T20:43:17Z

Addressed Issues:

Implement Wikipedia-Based Factual Consistency Evaluator

implements the factual accuracy evaluation metric listed as a success criterion in the project specification

Screenshots/Recordings:

Additional Notes:

The project specification requires evaluating trained models on factual accuracy. Unlike existing factual benchmarks, this evaluator uses Wikipedia itself as the source of truth, directly consistent with the project's core requirement that Wikipedia is the sole trusted data source.

How it works:

Reads processed wiki_clean.txt produced by the existing pipeline
Groups lines into passages, extracts named entities using a capitalized word-sequence regex
For each sentence containing a named entity, generates a counterfactual variant by substituting that entity with a different entity from the same passage (e.g. "Einstein was born in Germany" --> "Einstein was born in Poland")
Compares model perplexity on original vs counterfactual sentence pairs
A well trained model should assign lower perplexity to factual sentences

Why this approach over existing benchmarks:

TruthfulQA and similar benchmarks use privately curated Q&A pairs --> inconsistent with the project's open data philosophy
This evaluator derives ground truth entirely from Wikipedia, making it fully reproducible and verifiable by anyone with the same dump
No external dataset dependency, pure stdlib only

Determinism:
random.seed(42) is applied inside evaluate() before any entity selection, ensuring fully reproducible evaluation results, consistent with the project's core reproducibility requirement

Files changed:

openverifiablellm/eval/factual/factual_consistency.py : WikipediaFactualEvaluator
openverifiablellm/eval/factual/__init__.py: exports WikipediaFactualEvaluator
openverifiablellm/eval/base.py : abstract BaseEvaluator
openverifiablellm/eval/perplexity.py : PerplexityEvaluator with shared compute_sentence_perplexity static method
openverifiablellm/eval/__init__.py : exports all evaluators
tests/test_factual_eval.py : 10 tests, all passing, no network calls, includes determinism test
pyproject.toml : added datasets as runtime dependency

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

New Features
- Added a Perplexity evaluator to measure language-model perplexity on benchmarks.
- Added a Wikipedia-based Factual Consistency evaluator that generates counterfactuals and reports comparative scores.
Tests
- Added tests covering passage extraction, entity substitution, evaluator outputs, determinism, and sampling limits.
Chores
- Updated project dependencies to include dataset handling and aligned dev extras.

…in tests

coderabbitai · 2026-03-21T20:43:31Z

Walkthrough

Adds an evaluation framework: a BaseEvaluator protocol and two implementations — PerplexityEvaluator (dataset streaming, perplexity calculations) and WikipediaFactualEvaluator (counterfactual generation from wiki text and perplexity-based scoring). Exports updated, dependency added, and tests included.

Changes

Cohort / File(s)	Summary
Core API & Exports `openverifiablellm/eval/base.py`, `openverifiablellm/eval/__init__.py`	Introduce `Model`/`Tokenizer` protocols and abstract `BaseEvaluator`; expose `PerplexityEvaluator` at package level via `__all__`.
Perplexity Evaluator `openverifiablellm/eval/perplexity.py`	New `PerplexityEvaluator` implementing sequence/sentence perplexity computation, stride/window scoring, a deterministic `uniform_model` for tests, and dataset streaming-based `evaluate`.
Factual Consistency Evaluator `openverifiablellm/eval/factual/factual_consistency.py`, `openverifiablellm/eval/factual/__init__.py`	New `WikipediaFactualEvaluator` that extracts passages from wiki text, finds capitalized named-entity candidates, generates counterfactuals by entity substitution, and computes factual vs. counterfactual perplexities and a delta score.
Project Config `pyproject.toml`	Add `datasets` to dependencies and add `[project.optional-dependencies].dev` with `pytest` and `ruff`.
Tests `tests/test_factual_eval.py`	Add pytest suite covering passage extraction, entity substitution, evaluation outputs, numeric assertions, sample limits, and determinism via fixed random seed.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PerplexityEvaluator
    participant Dataset as "HuggingFace Dataset"
    participant Tokenizer
    participant Model

    Client->>PerplexityEvaluator: evaluate(model, tokenizer)
    PerplexityEvaluator->>Dataset: load(benchmark, split, streaming=True)
    Dataset-->>PerplexityEvaluator: stream rows
    loop up to n_samples
        PerplexityEvaluator->>Tokenizer: encode(text)
        Tokenizer-->>PerplexityEvaluator: token_ids
        PerplexityEvaluator->>Model: forward(token_ids windowed)
        Model-->>PerplexityEvaluator: logits
        PerplexityEvaluator->>PerplexityEvaluator: compute log-probs, aggregate NLL
    end
    PerplexityEvaluator-->>Client: {"perplexity": mean_ppl}

sequenceDiagram
    participant Client
    participant WikiEvaluator as "WikipediaFactualEvaluator"
    participant FS as "Wiki Text File"
    participant Tokenizer
    participant Perplexity as "PerplexityEvaluator"
    participant Model

    Client->>WikiEvaluator: evaluate(model, tokenizer)
    WikiEvaluator->>FS: read(wiki_clean.txt)
    FS-->>WikiEvaluator: raw text
    WikiEvaluator->>WikiEvaluator: _extract_passages() -> [{original, counterfactual}, ...]
    loop for each pair
        WikiEvaluator->>Tokenizer: encode(original)
        Tokenizer-->>WikiEvaluator: ids_orig
        WikiEvaluator->>Tokenizer: encode(counterfactual)
        Tokenizer-->>WikiEvaluator: ids_cf
        WikiEvaluator->>Perplexity: compute_sentence_perplexity(model, ids_orig)
        Perplexity->>Model: forward(ids_orig)
        Model-->>Perplexity: logits
        Perplexity-->>WikiEvaluator: factual_ppl
        WikiEvaluator->>Perplexity: compute_sentence_perplexity(model, ids_cf)
        Perplexity->>Model: forward(ids_cf)
        Model-->>Perplexity: logits
        Perplexity-->>WikiEvaluator: counterfactual_ppl
    end
    WikiEvaluator-->>Client: {"factual_perplexity": fp, "counterfactual_perplexity": cp, "factual_score": cp-fp}

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐰 I nibble tokens, swap names with a hop,

facts turned playful, then measured non-stop.
Base roots planted, perplexities bloom,
wiki bits danced in a counterfactual room,
tests keep my hops steady — ready to pop! 🎉

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Feat/wikipedia factual evaluator' clearly and concisely summarizes the main change: introducing a Wikipedia-based factual consistency evaluator as a new feature.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/base.py`:
- Around line 7-24: Add typing.Protocol-based type hints for the evaluate method
to improve IDE/static analysis: define Protocols (e.g., Model and Tokenizer)
describing the callable model (__call__(input_ids: list[int]) ->
list[list[float]]) and the tokenizer (encode(text: str) -> list[int]) and import
required types from typing/typing_extensions; then update the abstractmethod
signature of evaluate to accept model: Model and tokenizer: Tokenizer and return
dict. Ensure the Protocol definitions and imports are placed near the top of
openverifiablellm/eval/base.py and referenced in the evaluate method signature.

In `@openverifiablellm/eval/factual/factual_consistency.py`:
- Around line 85-95: The replacement can accidentally match substrings; instead
of str.replace(found_entity, substitute, 1) use a regex substitution anchored to
word boundaries so only the whole found_entity token is replaced: build a
pattern using re.escape(found_entity) wrapped with r'\b...\b' and call
re.sub(pattern, substitute, sentence, count=1). Update the block that uses
_ENTITY_RE, matches, found_entity, substitute, and candidate_entities to perform
this regex-based replacement (ensure re is imported where this module uses it).
- Around line 217-235: The score calculation can produce inf/nan when
compute_sentence_perplexity returns non-finite values; in the loop over pairs
(referencing pairs, tokenizer.encode,
PerplexityEvaluator.compute_sentence_perplexity and the lists factual_ppls,
counterfactual_ppls, score_diffs) check math.isfinite(factual_ppl) and
math.isfinite(cf_ppl) before appending or computing cf_ppl - factual_ppl and
skip that pair if either is non-finite; after the loop handle the case n==0 by
returning sane defaults (e.g., float("nan") for each metric) to avoid division
by zero and propagation of nan.

In `@openverifiablellm/eval/perplexity.py`:
- Around line 201-215: The code currently hardcodes load_dataset(self.benchmark,
split="test", streaming=True) which can fail for benchmarks without a "test"
split; update the logic in perplexity.py to accept a configurable split (e.g.,
self.split) and/or implement a safe fallback sequence (try "test", then
"validation", then "train") when calling hf_datasets.load_dataset, and raise a
clear error if none exist; modify the call site where ds is created and ensure
downstream logic using ds (the loop that uses tokenizer.encode and
self.n_samples and compute_sequence_perplexity) remains unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4348df36-40bf-47e6-a433-89ccbef3f87e

📥 Commits

Reviewing files that changed from the base of the PR and between 578bc79 and e72d58b.

📒 Files selected for processing (7)

openverifiablellm/eval/__init__.py
openverifiablellm/eval/base.py
openverifiablellm/eval/factual/__init__.py
openverifiablellm/eval/factual/factual_consistency.py
openverifiablellm/eval/perplexity.py
pyproject.toml
tests/test_factual_eval.py

openverifiablellm/eval/base.py

openverifiablellm/eval/factual/factual_consistency.py

openverifiablellm/eval/perplexity.py

…y handling

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/eval/factual/factual_consistency.py`:
- Around line 205-206: The evaluate() function currently calls random.seed(42)
which mutates global RNG; instead create a local RNG like rng =
random.Random(42), remove the global seed call, and pass this rng into
_extract_passages and any downstream substitution logic so all random operations
use rng (e.g., replace random.sample/random.choice calls inside
_extract_passages, substitution functions, or helper methods with
rng.sample/rng.choice). Update signatures for _extract_passages and any helper
functions to accept an rng parameter (defaulting to None or Random() if needed)
and thread it through to ensure determinism without touching global state.

In `@openverifiablellm/eval/perplexity.py`:
- Around line 125-161: The function compute_sequence_perplexity uses stride as
the loop step but does not validate it; add an early check at the top of
compute_sequence_perplexity to ensure stride is an integer >= 1 (e.g., if not
isinstance(stride, int) or stride < 1: raise ValueError("stride must be an
integer >= 1")), so zero or negative values (and non-integers) are rejected
before the for start in range(...) is executed.
- Around line 221-224: The loop currently hardcodes row.get("text") so add a
configurable parameter text_field (default "text") to the surrounding function
(the perplexity evaluator function that contains the loop over ds) and use
row.get(text_field, "") instead; validate after iterating a sample or the
dataset that at least one non-empty text was found and raise a clear
configuration error (or log and return) if none are present; propagate the new
text_field parameter through any public wrapper/CLI callers so benchmarks using
"content", "body", "article", etc. work correctly and update any related tests
to pass the alternative field name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0e4a5eda-8185-4706-ab84-9727eeedaeeb

📥 Commits

Reviewing files that changed from the base of the PR and between e72d58b and 5f770d0.

📒 Files selected for processing (3)

openverifiablellm/eval/base.py
openverifiablellm/eval/factual/factual_consistency.py
openverifiablellm/eval/perplexity.py

coderabbitai · 2026-03-21T21:00:08Z

openverifiablellm/eval/factual/factual_consistency.py

+        random.seed(42)
+        pairs = self._extract_passages(self.wiki_text_path, self.n_samples)


⚠️ Potential issue | 🟠 Major

Avoid mutating global RNG state in evaluate().

random.seed(42) resets process-wide randomness and can affect unrelated code paths. Use a local random.Random(42) instance and thread it through substitution logic.

Proposed fix

- `@staticmethod` - def _substitute_entity(sentence: str, candidate_entities: List[str]) -> Optional[str]: + `@staticmethod` + def _substitute_entity( + sentence: str, candidate_entities: List[str], rng: random.Random + ) -> Optional[str]: @@ - substitute = random.choice(alternatives) + substitute = rng.choice(alternatives) @@ - def _extract_passages( + def _extract_passages( wiki_text_path: Union[str, Path], n_samples: Optional[int], + rng: random.Random, ) -> List[dict]: @@ - counterfactual = WikipediaFactualEvaluator._substitute_entity( - sentence, all_entities - ) + counterfactual = WikipediaFactualEvaluator._substitute_entity( + sentence, all_entities, rng + ) @@ - random.seed(42) - pairs = self._extract_passages(self.wiki_text_path, self.n_samples) + rng = random.Random(42) + pairs = self._extract_passages(self.wiki_text_path, self.n_samples, rng)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/eval/factual/factual_consistency.py` around lines 205 - 206, The evaluate() function currently calls random.seed(42) which mutates global RNG; instead create a local RNG like rng = random.Random(42), remove the global seed call, and pass this rng into _extract_passages and any downstream substitution logic so all random operations use rng (e.g., replace random.sample/random.choice calls inside _extract_passages, substitution functions, or helper methods with rng.sample/rng.choice). Update signatures for _extract_passages and any helper functions to accept an rng parameter (defaulting to None or Random() if needed) and thread it through to ensure determinism without touching global state.

coderabbitai · 2026-03-21T21:00:08Z

openverifiablellm/eval/perplexity.py

+    def compute_sequence_perplexity(model, token_ids: List[int], stride: int = 512) -> float:
+        """
+        Compute perplexity over a (possibly long) sequence using non-overlapping
+        stride-sized windows.
+
+        The sequence is partitioned into windows of *stride* tokens.  Each
+        window contributes its token predictions to a pooled NLL.  The final
+        perplexity is ``exp(total_NLL / total_scored_tokens)``.
+
+        For sequences shorter than *stride* + 1 tokens the result is
+        identical to :meth:`compute_sentence_perplexity`.
+
+        Parameters
+        ----------
+        model : callable
+            ``model(input_ids) -> 2-D sequence`` of shape
+            ``(len(input_ids), vocab_size)``.
+        token_ids : list[int]
+            Tokenised sequence.
+        stride : int
+            Number of tokens scored per window.  Default ``512``.
+
+        Returns
+        -------
+        float
+            Perplexity (≥ 1).  Returns ``float("inf")`` for sequences
+            shorter than 2 tokens.
+        """
+        if len(token_ids) < 2:
+            return float("inf")
+
+        nll_sum = 0.0
+        n_scored = 0
+        n = len(token_ids)
+
+        for start in range(0, n - 1, stride):
+            end = min(start + stride + 1, n)


⚠️ Potential issue | 🟠 Major

Validate stride before using it as a window step.

stride is user-configurable; stride=0 raises at runtime and negative values yield invalid scoring behavior. Guard it early.

Proposed fix

def __init__( self, benchmark: str = "wikitext", n_samples: Optional[int] = 50, stride: int = 512, split: Optional[str] = None, ): + if stride <= 0: + raise ValueError("stride must be a positive integer.") self.benchmark = benchmark self.n_samples = n_samples self.stride = stride self.split = split @@ `@staticmethod` def compute_sequence_perplexity(model, token_ids: List[int], stride: int = 512) -> float: + if stride <= 0: + raise ValueError("stride must be a positive integer.") if len(token_ids) < 2: return float("inf")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/eval/perplexity.py` around lines 125 - 161, The function compute_sequence_perplexity uses stride as the loop step but does not validate it; add an early check at the top of compute_sequence_perplexity to ensure stride is an integer >= 1 (e.g., if not isinstance(stride, int) or stride < 1: raise ValueError("stride must be an integer >= 1")), so zero or negative values (and non-integers) are rejected before the for start in range(...) is executed.

coderabbitai · 2026-03-21T21:00:08Z

openverifiablellm/eval/perplexity.py

+        for row in ds:
+            text = row.get("text", "")
+            if not text.strip():
+                continue


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify actual row keys for the configured benchmark/split(s). # Update BENCHMARK as needed. python - <<'PY' import datasets BENCHMARK = "wikitext" splits = ["test", "validation", "train"] for s in splits: try: ds = datasets.load_dataset(BENCHMARK, split=s, streaming=True) row = next(iter(ds)) print(f"{s}: keys={list(row.keys())}") except Exception as e: print(f"{s}: unavailable ({type(e).__name__}: {e})") PY

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 192

🏁 Script executed:

# First, locate the file and examine it git ls-files | grep perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 108

🏁 Script executed:

# Read the perplexity.py file to see the implementation wc -l openverifiablellm/eval/perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 112

🏁 Script executed:

# Read the entire file with line numbers cat -n openverifiablellm/eval/perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 9490

🏁 Script executed:

# Search for test files related to perplexity find . -name "*test*" -type f | grep -i perplex

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54

🏁 Script executed:

# Search for usage of PerplexityEvaluator rg "PerplexityEvaluator" -n

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1158

🏁 Script executed:

# Check the test file to see how PerplexityEvaluator is being used cat -n tests/test_factual_eval.py | head -100

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 3445

🏁 Script executed:

# Check factual_consistency.py for how evaluate() is used cat -n openverifiablellm/eval/factual/factual_consistency.py | head -250

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 11115

🏁 Script executed:

# Search for calls to .evaluate() method that might use the "text" field rg "\.evaluate\(" -n | grep -i perplex

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54

🏁 Script executed:

# Search for dataset loading patterns and benchmark usage rg "load_dataset" -n -A 2 -B 2

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 895

🏁 Script executed:

# Check if there's any documentation or README about supported benchmarks find . -name "README*" -o -name "*.md" | head -20

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 136

🏁 Script executed:

# Check README for information about supported benchmarks cat -n README.md | head -150

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 5896

🏁 Script executed:

# Search for any test cases or examples that use different benchmarks rg "benchmark" -n openverifiablellm/eval/ | head -30

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 884

🏁 Script executed:

# Check if there are any other references to field names or text extraction rg "get\(" openverifiablellm/eval/perplexity.py

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 110

🏁 Script executed:

# Let's search for any comments or documentation about dataset field names rg "field" -n openverifiablellm/eval/

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 54

Add configurable text_field parameter to support benchmarks with different field names.

Line 222 hardcodes the "text" field name when extracting text from dataset rows. For Hugging Face datasets that use different field names (e.g., content, body, article), all rows are silently skipped and the evaluator returns inf instead of surfacing the misconfiguration.

Proposed fix

def __init__( self, benchmark: str = "wikitext", n_samples: Optional[int] = 50, stride: int = 512, split: Optional[str] = None, + text_field: str = "text", ): self.benchmark = benchmark self.n_samples = n_samples self.stride = stride self.split = split + self.text_field = text_field @@ - text = row.get("text", "") + text = row.get(self.text_field, "") if not text.strip(): continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for row in ds:

text = row.get("text", "")

if not text.strip():

continue

def __init__(

self,

benchmark: str = "wikitext",

n_samples: Optional[int] = 50,

stride: int = 512,

split: Optional[str] = None,

text_field: str = "text",

):

self.benchmark = benchmark

self.n_samples = n_samples

self.stride = stride

self.split = split

self.text_field = text_field

# ... in the evaluate() method:

for row in ds:

text = row.get(self.text_field, "")

if not text.strip():

continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/eval/perplexity.py` around lines 221 - 224, The loop currently hardcodes row.get("text") so add a configurable parameter text_field (default "text") to the surrounding function (the perplexity evaluator function that contains the loop over ds) and use row.get(text_field, "") instead; validate after iterating a sample or the dataset that at least one non-empty text was found and raise a clear configuration error (or log and return) if none are present; propagate the new text_field parameter through any public wrapper/CLI callers so benchmarks using "content", "body", "article", etc. work correctly and update any related tests to pass the alternative field name.

c-sus-sus added 6 commits March 22, 2026 00:10

feat: implement BiasEvaluator using WinoBias benchmark

12d5d55

implement WinoBiasEvaluator as one of bias evaluation suite

b3ef1fb

Avoid NaN bias scores when both split scores are infinite

3e8addc

added three tests

33a36dd

validate logit shapes, fix n_samples blank-row counting, strict mock …

c6288c0

…in tests

implement factual evaluator

e72d58b

github-actions bot added no-issue-linked backend configuration python size/XL labels Mar 21, 2026

github-actions bot added first-time-contributor pending-coderabbit-review size/XL and removed size/XL labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

openverifiablellm/eval/base.py Show resolved Hide resolved

openverifiablellm/eval/factual/factual_consistency.py Outdated Show resolved Hide resolved

openverifiablellm/eval/factual/factual_consistency.py Show resolved Hide resolved

openverifiablellm/eval/perplexity.py Show resolved Hide resolved

github-actions bot added size/XL and removed size/XL labels Mar 21, 2026

added Protocol types, word boundary substitution and robust perplexit…

5f770d0

…y handling

github-actions bot added size/XL and removed size/XL labels Mar 21, 2026

coderabbitai bot requested changes Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/wikipedia factual evaluator#84

Feat/wikipedia factual evaluator#84
chaitanyamedidar wants to merge 7 commits intoAOSSIE-Org:mainfrom
chaitanyamedidar:feat/wikipedia-factual-evaluator

chaitanyamedidar commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

coderabbitai bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		random.seed(42)
		pairs = self._extract_passages(self.wiki_text_path, self.n_samples)

-        for row in ds:
-            text = row.get("text", "")
-            if not text.strip():
-                continue
+def __init__(
+    self,
+    benchmark: str = "wikitext",
+    n_samples: Optional[int] = 50,
+    stride: int = 512,
+    split: Optional[str] = None,
+    text_field: str = "text",
+):
+    self.benchmark = benchmark
+    self.n_samples = n_samples
+    self.stride = stride
+    self.split = split
+    self.text_field = text_field
+# ... in the evaluate() method:
+        for row in ds:
+            text = row.get(self.text_field, "")
+            if not text.strip():
+                continue

Uh oh!

Conversation

chaitanyamedidar commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Implement Wikipedia-Based Factual Consistency Evaluator

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaitanyamedidar commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading