Skip to content

Commit f19beb9

Browse files
authored
feat(openlibrary): add 3 engagement & comparison templates (#13)
* refactor(openlibrary): extract author-search helpers to common.py Move normalize_author_fragment, extract_author_filter, and find_author_search_entry from author_editions.py class methods into common.py as module-level functions. This eliminates duplication for upcoming author-based templates that need the same lookup logic. * feat(openlibrary): add author_engagement_extrema template (ID 96) Find the book with the highest/lowest engagement metric among an author's top N search results. Uses confirmed-visible fields only: want_to_read_count, already_read_count, ratings_count. Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680. * feat(openlibrary): add author_comparison template (ID 97) Compare aggregate engagement metrics between two authors' top N search results. Requires two separate author searches and cross-page comparison. Variant space: C(70,2) × 3 metrics × 2 counts = 14,490. * feat(openlibrary): add reading_stats_filter template (ID 98) Count books in an author's catalog meeting an engagement threshold. Requires scanning each book's metric against a threshold — cannot be solved by sorting a single column. Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680. * test(openlibrary): add tests for engagement & comparison templates 56 tests covering: - Template registration and generation invariants - author_engagement_extrema GT: highest/lowest, tie-breaking, missing data - author_comparison GT: higher total, reverse winner, tie, missing author - reading_stats_filter GT: threshold counting, zero matches, exact boundary - Task registry wiring (IDs 96, 97, 98, Version 7) - Shared helper refactoring (common.py functions) - Cross-template consistency (serialization, GT source, cache source) * fix: accept plain-text author queries in find_author_search_entry * fix(openlibrary): reduce live GT not_collected for author templates * docs(pr): update description * fix: address PR #13 review — remove broken authors, drop already_read_count, clean up BLOCKING fixes: - Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results: Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman, Dickinson, Tagore). Pool: 70 → 61. - Remove already_read_count from EngagementMetric, AuthorMetric, and ReaderMetric enums — not visible on search results page (only want_to_read and ratings counts are rendered). NON-BLOCKING fixes: - Add comment in author_editions.py documenting allow_unsorted_fallback asymmetry between existing and new templates. - Remove pr_description.md from repository. Tests updated to reflect metric and pool changes. 106 passed. * fix: treat missing engagement metrics as 0 instead of hard-failing The OL API omits count fields (ratings_count, want_to_read_count) when the value is zero, rather than returning 0. Previously the GT methods returned GroundTruthResult.fail() for missing fields, causing hard failures for works that simply haven't been rated yet. Now treats absent metrics as 0.0, which is semantically correct and consistent with how the OL API represents zero-count data. This prevents GT failures for individual works missing ratings_count even among authors that generally have good data coverage. Also fixes _make_search_entry type hint (sort: Optional[str]) and removes unused title variables flagged by ruff. * fix: handle non-numeric metric values without TypeError If a metric field contains a non-numeric string like 'N/A', parse_numeric() returns None. Previously this None was passed to int(value) or numeric comparisons, causing a TypeError at runtime. Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None. This covers both absent fields (raw is None) and non-numeric strings (parse_numeric returns None). Adds regression test for 'N/A' metric values. * refactor: extract safe_metric_value helper to reduce duplication The 3-line metric normalization pattern (raw → parse_numeric → fallback to 0.0) was duplicated across all 3 new templates. Extracted to safe_metric_value() in common.py, reducing each call site to a single line and ensuring consistent handling of absent/non-numeric fields. * fix: drop ratings_count from all templates, fail on non-numeric data BLOCKING: ratings_count is missing for 56% of authors in the OL API, causing wrong GT for extrema-lowest queries (missing-as-zero always wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and ReaderMetric — all templates now use only want_to_read_count. Expanded RESULT_COUNTS to keep variant space above 500 minimum: - T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610 - T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660 - T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732 NON-BLOCKING: safe_metric_value now raises ValueError on non-null non-numeric values (e.g. 'N/A') instead of silently treating them as 0. Missing (None) values still default to 0. Callers catch ValueError and surface it as GroundTruthResult.fail(). * fix: docstring drift and add non-numeric regression tests for comparison/filter - Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py that still mentioned 'ratings' after ratings_count was dropped. - Add non-numeric metric regression tests for comparison and filter templates to match the existing extrema test, ensuring all 3 safe_metric_value call sites are explicitly tested for ValueError handling. * fix: restore ratings_count with targeted exclusions for anti-memorization BLOCKING: With a single metric (want_to_read_count), the entire answer space was enumerable from 61 API calls (~5,000 entries). Restoring ratings_count as a second metric dimension breaks trivial enumeration. Changes: - Remove 5 authors with worst ratings_count coverage (Emerson, Joyce, Melville, Hawthorne, P.K. Dick). Pool: 61 → 56. - Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric. - T96: exclude ratings_count from extrema=lowest only (where missing-as-zero would always win). Highest/comparison/filter are unaffected by the bias. - T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values). - Restore THRESHOLDS for ratings_count in T98. Variant spaces (all >1000): - T96: 56 × (highest×2 + lowest×1) × 6 = 1,008 - T97: C(56,2) × 2 × 2 = 6,160 - T98: 56 × 2 × 4 × 3 = 1,344 Adds test_extrema_lowest_excludes_ratings_count to verify the per-extrema metric filtering. 364 tests pass. * fix(openlibrary): expand AUTHOR_POOL and RESULT_COUNTS for T96 variant space - Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization - Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25] to increase lowest-extrema differentiation - Effective variant space: ~583 (16.6% margin above 500 threshold) - Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants - Fix AUTHOR_POOL section comments to reflect actual counts - Split test file (481+490 lines, both <500) - Remove unused get_registered_templates import - Add tests: pool size=81, no duplicates, ratings_count GT * fix(openlibrary): raise search fetch limit to 25 for T96 work_count=25 The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing guaranteed GT failure for 1/7 of T96 variants. Raise limit to match. Add regression test: test_extrema_gt_succeeds_with_25_works * fix(openlibrary): separate ENGAGEMENT_AUTHOR_POOL, cap lowest RESULT_COUNTS Address PR review #8: 1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main to preserve author_editions reproducibility. Create separate ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98. 2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to avoid missing-as-zero domination of want_to_read_count at high work_counts (41% of authors affected at work_count=25). 3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py. Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants. * fix(openlibrary): address PR #13 review — deterministic GT, numeric T97, strict metrics BLOCKING fixes: - Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98). GT now strictly requires sort=editions data, matching the question text. If the agent doesn't visit the sorted page, GT correctly returns not_collected instead of silently using wrong-order results. - Make safe_metric_value fail on missing ratings_count instead of defaulting to 0. Only want_to_read_count (high API coverage) defaults to 0 when absent. ratings_count absence raises ValueError → GT fail, preventing semantically wrong answers from sparse data. - Redesign T97 (author_comparison) from binary "which author has more?" (50% random baseline) to numeric "what is the absolute difference?" (near-0% random baseline). GT returns str(abs(sum_a - sum_b)). - Add Version 7 coordination comment for PR #14 (IDs 99-101 → Version 8). NON-BLOCKING fixes: - Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set + additions list, eliminating 56-entry duplication and preventing drift. AUTHOR_POOL itself is unchanged (author_editions reproducibility). - Remove stale allow_unsorted_fallback asymmetry comment from author_editions.py (all templates now consistently use strict sort). Tests: 372 passed (118 OpenLibrary, 254 other). * fix(openlibrary): cap ratings_count variants to low N to reduce GT-fail from sparse OL data ratings_count is missing for 20-40% of authors at N≥7. Restrict ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where coverage is highest, cutting estimated GT-fail exposure from ~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged. * test(openlibrary): verify GT computation with real OL API data Fetch live data (March 26, 2026) for Agatha Christie, Stephen King, and Neil Gaiman via sort=editions search API. Inject into GT collector and verify all three templates (T96/T97/T98) return concrete values with both want_to_read_count and ratings_count metrics. 12 tests cover: highest/lowest extrema, cross-author numeric difference, and threshold counting — satisfying CLAUDE.md §5 item 1. --------- Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
1 parent d96dcf9 commit f19beb9

12 files changed

Lines changed: 2294 additions & 76 deletions

liveweb_arena/core/task_registry.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,11 @@ class TaskRegistry:
153153
92: ("arxiv", "arxiv_category_comparison"),
154154
94: ("arxiv", "arxiv_multi_author_filter"),
155155
95: ("arxiv", "arxiv_title_length_extrema"),
156+
157+
# Open Library templates — engagement & comparison
158+
96: ("openlibrary", "openlibrary_author_engagement_extrema"),
159+
97: ("openlibrary", "openlibrary_author_comparison"),
160+
98: ("openlibrary", "openlibrary_reading_stats_filter"),
156161
}
157162

158163
# Template versions - each version's combinations come AFTER all previous versions
@@ -181,6 +186,9 @@ class TaskRegistry:
181186
[85, 86, 87, 88],
182187
# Version 6: ArXiv templates
183188
[90, 91, 92, 94, 95],
189+
# Version 7: Open Library engagement & comparison templates (PR #13)
190+
# NOTE: PR #14 (openmeteo IDs 99-101) must use Version 8.
191+
[96, 97, 98],
184192
]
185193

186194
# Combination registry: list of template ID tuples

liveweb_arena/plugins/openlibrary/openlibrary.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ async def fetch_api_data(self, url: str) -> Dict[str, Any]:
6060
sort = parse_qs(parsed.query).get("sort", [None])[0]
6161
mode = parse_qs(parsed.query).get("mode", [None])[0]
6262
if query:
63-
return await fetch_search_api_data(query, limit=20, sort=sort, mode=mode)
63+
# limit=25 to support T96 RESULT_COUNTS up to work_count=25
64+
return await fetch_search_api_data(query, limit=25, sort=sort, mode=mode)
6465
return {}
6566

6667
# Work detail page: /works/OL...W or /works/OL...W/Title

liveweb_arena/plugins/openlibrary/templates/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,16 @@
1010
from .book_comparison import OpenLibraryBookComparisonTemplate
1111
from .author_editions import OpenLibraryAuthorEditionsTemplate
1212
from .subject_multi_condition import OpenLibrarySubjectMultiConditionTemplate
13+
from .author_engagement_extrema import OpenLibraryAuthorEngagementExtremaTemplate
14+
from .author_comparison import OpenLibraryAuthorComparisonTemplate
15+
from .reading_stats_filter import OpenLibraryReadingStatsFilterTemplate
1316

1417
__all__ = [
1518
"OpenLibraryBookStatsTemplate",
1619
"OpenLibraryBookComparisonTemplate",
1720
"OpenLibraryAuthorEditionsTemplate",
1821
"OpenLibrarySubjectMultiConditionTemplate",
22+
"OpenLibraryAuthorEngagementExtremaTemplate",
23+
"OpenLibraryAuthorComparisonTemplate",
24+
"OpenLibraryReadingStatsFilterTemplate",
1925
]
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
"""Author comparison template for Open Library - MEDIUM/HARD DIFFICULTY.
2+
3+
RL-friendly design:
4+
- Requires TWO separate author searches and cross-page comparison
5+
- Dynamic data: engagement metrics change continuously as users interact
6+
- Large entity pool: C(81,2)×2 metrics×2 result counts = 12,960 variants
7+
- Computation required: sum metric across N books for each author, compute difference
8+
- Numeric answer (absolute difference) avoids 50% random baseline of binary choice
9+
"""
10+
11+
import random
12+
from enum import Enum
13+
from typing import Any, Dict, Optional
14+
from urllib.parse import quote_plus
15+
16+
from liveweb_arena.core.ground_truth_trigger import (
17+
GroundTruthResult,
18+
TriggerConfig,
19+
UrlPatternTrigger,
20+
)
21+
from liveweb_arena.core.gt_collector import GTSourceType
22+
from liveweb_arena.core.validators.base import (
23+
GeneratedQuestion,
24+
QuestionTemplate,
25+
ValidationResult,
26+
register_template,
27+
)
28+
from .author_editions import ENGAGEMENT_AUTHOR_POOL
29+
from .common import find_author_search_entry, get_collected_data, safe_metric_value
30+
31+
32+
class AuthorMetric(Enum):
33+
"""Engagement metrics for cross-author comparison."""
34+
WANT_TO_READ = ("want_to_read_count", "total want-to-read count")
35+
RATINGS_COUNT = ("ratings_count", "total number of ratings")
36+
37+
38+
RESULT_COUNTS = [3, 5]
39+
40+
PATTERNS = [
41+
(
42+
'On Open Library, search for books by "{author_a}" and "{author_b}", '
43+
"both sorted by most editions. What is the absolute difference in "
44+
"{metric_label} between the first {n} results for each author? "
45+
"Answer with just the number."
46+
),
47+
(
48+
'Compare "{author_a}" and "{author_b}" on Open Library. For each author, '
49+
"look at the top {n} books (sorted by most editions) and sum their "
50+
"{metric_label}. What is the absolute difference between the two totals? "
51+
"Reply with just a number."
52+
),
53+
(
54+
'Search Open Library for books by "{author_a}" and by "{author_b}" '
55+
"(most editions). Sum the {metric_label} across each author's top {n} "
56+
"results. What is the absolute difference? Answer with the number only."
57+
),
58+
]
59+
60+
61+
@register_template("openlibrary_author_comparison")
62+
class OpenLibraryAuthorComparisonTemplate(QuestionTemplate):
63+
"""Compare aggregate engagement metrics between two authors' top works.
64+
65+
MEDIUM/HARD difficulty: requires two separate author searches, summing
66+
a metric across top N results for each, then comparing the totals.
67+
"""
68+
69+
GT_SOURCE = GTSourceType.PAGE_ONLY
70+
71+
def __init__(self):
72+
super().__init__("openlibrary_author_comparison")
73+
74+
def generate(self, seed: int, variant: Optional[int] = None) -> GeneratedQuestion:
75+
rng = random.Random(seed)
76+
77+
metrics = list(AuthorMetric)
78+
metric = (
79+
metrics[variant % len(metrics)]
80+
if variant is not None
81+
else rng.choice(metrics)
82+
)
83+
84+
(name_a, query_a), (name_b, query_b) = rng.sample(ENGAGEMENT_AUTHOR_POOL, 2)
85+
86+
# Randomly swap order to prevent position bias
87+
if rng.random() > 0.5:
88+
name_a, query_a, name_b, query_b = name_b, query_b, name_a, query_a
89+
90+
count = rng.choice(RESULT_COUNTS)
91+
search_query_a = f'author:"{query_a}"'
92+
search_query_b = f'author:"{query_b}"'
93+
94+
pattern = rng.choice(PATTERNS)
95+
question_text = pattern.format(
96+
author_a=name_a,
97+
author_b=name_b,
98+
n=count,
99+
metric_label=metric.value[1],
100+
)
101+
102+
query_encoded_a = quote_plus(search_query_a)
103+
start_url = (
104+
f"https://openlibrary.org/search?q={query_encoded_a}&sort=editions"
105+
)
106+
107+
return GeneratedQuestion(
108+
question_text=question_text,
109+
start_url=start_url,
110+
variables={
111+
"author_a": name_a,
112+
"author_b": name_b,
113+
"metric": metric.value[0],
114+
"work_count": count,
115+
},
116+
validation_info={
117+
"author_a_name": name_a,
118+
"author_a_query": query_a,
119+
"search_query_a": search_query_a,
120+
"author_b_name": name_b,
121+
"author_b_query": query_b,
122+
"search_query_b": search_query_b,
123+
"sort": "editions",
124+
"work_count": count,
125+
"metric": metric.value[0],
126+
"metric_label": metric.value[1],
127+
},
128+
template_name=self.name,
129+
expected_steps=12,
130+
)
131+
132+
def get_validation_rules(self, validation_info: Dict[str, Any]) -> str:
133+
author_a = validation_info.get("author_a_name", "")
134+
author_b = validation_info.get("author_b_name", "")
135+
count = validation_info.get("work_count", "")
136+
metric_label = validation_info.get("metric_label", "")
137+
return f"""Task-Specific Rules (Open Library Author Comparison):
138+
- Compare: "{author_a}" vs "{author_b}"
139+
- Metric: {metric_label} summed across top {count} results
140+
- Answer: absolute difference between the two totals (a single number)
141+
- Score 1.0: Exact difference
142+
- Score 0.5: Within ±10% of correct difference
143+
- Score 0.0: Wrong value or no answer"""
144+
145+
async def get_ground_truth(self, validation_info: Dict[str, Any]) -> GroundTruthResult:
146+
collected = get_collected_data()
147+
if not collected:
148+
return GroundTruthResult.fail("No Open Library data collected")
149+
150+
author_a_name = validation_info.get("author_a_name")
151+
author_b_name = validation_info.get("author_b_name")
152+
search_query_a = validation_info.get("search_query_a")
153+
search_query_b = validation_info.get("search_query_b")
154+
sort = validation_info.get("sort")
155+
work_count = validation_info.get("work_count")
156+
metric = validation_info.get("metric")
157+
158+
if (
159+
not isinstance(author_a_name, str)
160+
or not isinstance(author_b_name, str)
161+
or not isinstance(search_query_a, str)
162+
or not isinstance(search_query_b, str)
163+
or not isinstance(sort, str)
164+
or not isinstance(work_count, int)
165+
or not isinstance(metric, str)
166+
):
167+
return GroundTruthResult.fail("Missing or invalid comparison inputs")
168+
if work_count <= 0:
169+
return GroundTruthResult.fail(f"Invalid work_count: {work_count}")
170+
171+
sum_a = self._sum_metric(
172+
collected, author_a_name, search_query_a, sort, work_count, metric,
173+
)
174+
if isinstance(sum_a, GroundTruthResult):
175+
return sum_a
176+
177+
sum_b = self._sum_metric(
178+
collected, author_b_name, search_query_b, sort, work_count, metric,
179+
)
180+
if isinstance(sum_b, GroundTruthResult):
181+
return sum_b
182+
183+
return GroundTruthResult.ok(str(abs(sum_a - sum_b)))
184+
185+
@staticmethod
186+
def _sum_metric(
187+
collected: Dict[str, Dict[str, Any]],
188+
author_name: str,
189+
search_query: str,
190+
sort: str,
191+
work_count: int,
192+
metric: str,
193+
) -> "int | GroundTruthResult":
194+
"""Sum a metric across an author's top N search results.
195+
196+
Returns the integer sum on success, or a GroundTruthResult on failure.
197+
"""
198+
data = find_author_search_entry(
199+
collected,
200+
search_query=search_query,
201+
sort=sort,
202+
)
203+
if data is None:
204+
ol_keys = [k for k in collected if k.startswith("ol:")][:5]
205+
return GroundTruthResult.not_collected(
206+
f"Did not collect search data for author '{author_name}' "
207+
f"sorted by '{sort}'. Collected OL keys: {ol_keys}"
208+
)
209+
210+
works_dict = data.get("works")
211+
if not isinstance(works_dict, dict):
212+
return GroundTruthResult.fail(
213+
f"Collected data for '{author_name}' missing works dictionary"
214+
)
215+
if len(works_dict) < work_count:
216+
return GroundTruthResult.fail(
217+
f"Only {len(works_dict)} works collected for '{author_name}', "
218+
f"need {work_count}"
219+
)
220+
221+
ranked = sorted(works_dict.values(), key=lambda w: w.get("rank", 999))
222+
top_n = ranked[:work_count]
223+
224+
total = 0
225+
for work in top_n:
226+
try:
227+
value = safe_metric_value(work, metric)
228+
except ValueError as exc:
229+
return GroundTruthResult.fail(str(exc))
230+
total += int(value)
231+
232+
return total
233+
234+
async def validate_answer(
235+
self,
236+
answer: str,
237+
validation_info: Dict[str, Any],
238+
) -> ValidationResult:
239+
return ValidationResult(
240+
score=0.0,
241+
is_correct=False,
242+
expected=None,
243+
actual=answer,
244+
details="Use LLM validation",
245+
)
246+
247+
def get_ground_truth_trigger(self, validation_info: dict) -> TriggerConfig:
248+
trigger = UrlPatternTrigger(domains=["openlibrary.org"])
249+
return TriggerConfig(trigger=trigger)
250+
251+
@classmethod
252+
def get_cache_source(cls) -> str:
253+
return "openlibrary"
254+
255+
def get_gt_source(self) -> GTSourceType:
256+
return self.GT_SOURCE

0 commit comments

Comments
 (0)