Commit f19beb9
authored
feat(openlibrary): add 3 engagement & comparison templates (#13)
* refactor(openlibrary): extract author-search helpers to common.py
Move normalize_author_fragment, extract_author_filter, and
find_author_search_entry from author_editions.py class methods into
common.py as module-level functions. This eliminates duplication for
upcoming author-based templates that need the same lookup logic.
* feat(openlibrary): add author_engagement_extrema template (ID 96)
Find the book with the highest/lowest engagement metric among an
author's top N search results. Uses confirmed-visible fields only:
want_to_read_count, already_read_count, ratings_count.
Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680.
* feat(openlibrary): add author_comparison template (ID 97)
Compare aggregate engagement metrics between two authors' top N search
results. Requires two separate author searches and cross-page comparison.
Variant space: C(70,2) × 3 metrics × 2 counts = 14,490.
* feat(openlibrary): add reading_stats_filter template (ID 98)
Count books in an author's catalog meeting an engagement threshold.
Requires scanning each book's metric against a threshold — cannot be
solved by sorting a single column.
Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680.
* test(openlibrary): add tests for engagement & comparison templates
56 tests covering:
- Template registration and generation invariants
- author_engagement_extrema GT: highest/lowest, tie-breaking, missing data
- author_comparison GT: higher total, reverse winner, tie, missing author
- reading_stats_filter GT: threshold counting, zero matches, exact boundary
- Task registry wiring (IDs 96, 97, 98, Version 7)
- Shared helper refactoring (common.py functions)
- Cross-template consistency (serialization, GT source, cache source)
* fix: accept plain-text author queries in find_author_search_entry
* fix(openlibrary): reduce live GT not_collected for author templates
* docs(pr): update description
* fix: address PR #13 review — remove broken authors, drop already_read_count, clean up
BLOCKING fixes:
- Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results:
Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse
ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman,
Dickinson, Tagore). Pool: 70 → 61.
- Remove already_read_count from EngagementMetric, AuthorMetric, and
ReaderMetric enums — not visible on search results page (only
want_to_read and ratings counts are rendered).
NON-BLOCKING fixes:
- Add comment in author_editions.py documenting allow_unsorted_fallback
asymmetry between existing and new templates.
- Remove pr_description.md from repository.
Tests updated to reflect metric and pool changes. 106 passed.
* fix: treat missing engagement metrics as 0 instead of hard-failing
The OL API omits count fields (ratings_count, want_to_read_count) when
the value is zero, rather than returning 0. Previously the GT methods
returned GroundTruthResult.fail() for missing fields, causing hard
failures for works that simply haven't been rated yet.
Now treats absent metrics as 0.0, which is semantically correct and
consistent with how the OL API represents zero-count data. This
prevents GT failures for individual works missing ratings_count even
among authors that generally have good data coverage.
Also fixes _make_search_entry type hint (sort: Optional[str]) and
removes unused title variables flagged by ruff.
* fix: handle non-numeric metric values without TypeError
If a metric field contains a non-numeric string like 'N/A',
parse_numeric() returns None. Previously this None was passed to
int(value) or numeric comparisons, causing a TypeError at runtime.
Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None.
This covers both absent fields (raw is None) and non-numeric strings
(parse_numeric returns None).
Adds regression test for 'N/A' metric values.
* refactor: extract safe_metric_value helper to reduce duplication
The 3-line metric normalization pattern (raw → parse_numeric → fallback
to 0.0) was duplicated across all 3 new templates. Extracted to
safe_metric_value() in common.py, reducing each call site to a single
line and ensuring consistent handling of absent/non-numeric fields.
* fix: drop ratings_count from all templates, fail on non-numeric data
BLOCKING: ratings_count is missing for 56% of authors in the OL API,
causing wrong GT for extrema-lowest queries (missing-as-zero always
wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and
ReaderMetric — all templates now use only want_to_read_count.
Expanded RESULT_COUNTS to keep variant space above 500 minimum:
- T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610
- T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660
- T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732
NON-BLOCKING: safe_metric_value now raises ValueError on non-null
non-numeric values (e.g. 'N/A') instead of silently treating them
as 0. Missing (None) values still default to 0. Callers catch
ValueError and surface it as GroundTruthResult.fail().
* fix: docstring drift and add non-numeric regression tests for comparison/filter
- Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py
that still mentioned 'ratings' after ratings_count was dropped.
- Add non-numeric metric regression tests for comparison and filter templates
to match the existing extrema test, ensuring all 3 safe_metric_value
call sites are explicitly tested for ValueError handling.
* fix: restore ratings_count with targeted exclusions for anti-memorization
BLOCKING: With a single metric (want_to_read_count), the entire answer
space was enumerable from 61 API calls (~5,000 entries). Restoring
ratings_count as a second metric dimension breaks trivial enumeration.
Changes:
- Remove 5 authors with worst ratings_count coverage (Emerson, Joyce,
Melville, Hawthorne, P.K. Dick). Pool: 61 → 56.
- Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric.
- T96: exclude ratings_count from extrema=lowest only (where
missing-as-zero would always win). Highest/comparison/filter are
unaffected by the bias.
- T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values).
- Restore THRESHOLDS for ratings_count in T98.
Variant spaces (all >1000):
- T96: 56 × (highest×2 + lowest×1) × 6 = 1,008
- T97: C(56,2) × 2 × 2 = 6,160
- T98: 56 × 2 × 4 × 3 = 1,344
Adds test_extrema_lowest_excludes_ratings_count to verify the
per-extrema metric filtering. 364 tests pass.
* fix(openlibrary): expand AUTHOR_POOL and RESULT_COUNTS for T96 variant space
- Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization
- Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25]
to increase lowest-extrema differentiation
- Effective variant space: ~583 (16.6% margin above 500 threshold)
- Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants
- Fix AUTHOR_POOL section comments to reflect actual counts
- Split test file (481+490 lines, both <500)
- Remove unused get_registered_templates import
- Add tests: pool size=81, no duplicates, ratings_count GT
* fix(openlibrary): raise search fetch limit to 25 for T96 work_count=25
The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing
guaranteed GT failure for 1/7 of T96 variants. Raise limit to match.
Add regression test: test_extrema_gt_succeeds_with_25_works
* fix(openlibrary): separate ENGAGEMENT_AUTHOR_POOL, cap lowest RESULT_COUNTS
Address PR review #8:
1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main
to preserve author_editions reproducibility. Create separate
ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98.
2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to
avoid missing-as-zero domination of want_to_read_count at high
work_counts (41% of authors affected at work_count=25).
3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py.
Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.
* fix(openlibrary): address PR #13 review — deterministic GT, numeric T97, strict metrics
BLOCKING fixes:
- Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98).
GT now strictly requires sort=editions data, matching the question text.
If the agent doesn't visit the sorted page, GT correctly returns
not_collected instead of silently using wrong-order results.
- Make safe_metric_value fail on missing ratings_count instead of
defaulting to 0. Only want_to_read_count (high API coverage) defaults
to 0 when absent. ratings_count absence raises ValueError → GT fail,
preventing semantically wrong answers from sparse data.
- Redesign T97 (author_comparison) from binary "which author has more?"
(50% random baseline) to numeric "what is the absolute difference?"
(near-0% random baseline). GT returns str(abs(sum_a - sum_b)).
- Add Version 7 coordination comment for PR #14 (IDs 99-101 → Version 8).
NON-BLOCKING fixes:
- Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set +
additions list, eliminating 56-entry duplication and preventing drift.
AUTHOR_POOL itself is unchanged (author_editions reproducibility).
- Remove stale allow_unsorted_fallback asymmetry comment from
author_editions.py (all templates now consistently use strict sort).
Tests: 372 passed (118 OpenLibrary, 254 other).
* fix(openlibrary): cap ratings_count variants to low N to reduce GT-fail from sparse OL data
ratings_count is missing for 20-40% of authors at N≥7. Restrict
ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where
coverage is highest, cutting estimated GT-fail exposure from
~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged.
* test(openlibrary): verify GT computation with real OL API data
Fetch live data (March 26, 2026) for Agatha Christie, Stephen King,
and Neil Gaiman via sort=editions search API. Inject into GT collector
and verify all three templates (T96/T97/T98) return concrete values
with both want_to_read_count and ratings_count metrics.
12 tests cover: highest/lowest extrema, cross-author numeric difference,
and threshold counting — satisfying CLAUDE.md §5 item 1.
---------
Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>1 parent d96dcf9 commit f19beb9
12 files changed
Lines changed: 2294 additions & 76 deletions
File tree
- liveweb_arena
- core
- plugins/openlibrary
- templates
- tests/plugins/openlibrary
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
153 | 153 | | |
154 | 154 | | |
155 | 155 | | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
156 | 161 | | |
157 | 162 | | |
158 | 163 | | |
| |||
181 | 186 | | |
182 | 187 | | |
183 | 188 | | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
184 | 192 | | |
185 | 193 | | |
186 | 194 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
| 63 | + | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
13 | 16 | | |
14 | 17 | | |
15 | 18 | | |
16 | 19 | | |
17 | 20 | | |
18 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
19 | 25 | | |
Lines changed: 256 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
0 commit comments