feat(openlibrary): add 3 engagement & comparison templates#13
feat(openlibrary): add 3 engagement & comparison templates#13MkDev11 wants to merge 21 commits intoAffineFoundation:mainfrom
Conversation
Move normalize_author_fragment, extract_author_filter, and find_author_search_entry from author_editions.py class methods into common.py as module-level functions. This eliminates duplication for upcoming author-based templates that need the same lookup logic.
Find the book with the highest/lowest engagement metric among an author's top N search results. Uses confirmed-visible fields only: want_to_read_count, already_read_count, ratings_count. Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680.
Compare aggregate engagement metrics between two authors' top N search results. Requires two separate author searches and cross-page comparison. Variant space: C(70,2) × 3 metrics × 2 counts = 14,490.
Count books in an author's catalog meeting an engagement threshold. Requires scanning each book's metric against a threshold — cannot be solved by sorting a single column. Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680.
56 tests covering: - Template registration and generation invariants - author_engagement_extrema GT: highest/lowest, tie-breaking, missing data - author_comparison GT: higher total, reverse winner, tie, missing author - reading_stats_filter GT: threshold counting, zero matches, exact boundary - Task registry wiring (IDs 96, 97, 98, Version 7) - Shared helper refactoring (common.py functions) - Cross-template consistency (serialization, GT source, cache source)
355acae to
6a893b4
Compare
|
@angosr can you take a look at the pr and let me know your feedback? |
angosr
left a comment
There was a problem hiding this comment.
Review: 3 New Open Library Engagement Templates
Overall a well-structured PR with clean commit history, thorough tests (65 passed), and good refactoring of shared helpers into common.py. The template designs are sound in principle. However, there are blocking data-reliability issues that must be fixed before merge.
BLOCKING: 4 authors in AUTHOR_POOL are broken on the API (Red Team Check 1 + 6)
I queried all 70 authors in AUTHOR_POOL against the live Open Library API. Four authors return fewer than 10 results, making them unusable for templates that request work_count up to 10:
| Author | Results found | Notes |
|---|---|---|
| Fyodor Dostoevsky | 1 | OL stores under Russian name variants |
| Haruki Murakami | 6 | OL stores under Japanese romanization variants |
| Anton Chekhov | 3 | Same issue as Dostoevsky |
| Octavia Butler | 1 | Likely name mismatch |
When these authors are selected with work_count=7 or work_count=10, GT will always fail with "Only N works collected, need M". This is not a graceful degradation — it's a guaranteed failure for ~6% of the author pool.
Fix required: Either remove these 4 authors from AUTHOR_POOL, or add validated aliases that return sufficient results (e.g., "Dostoyevsky" instead of "Dostoevsky"). Validate by querying every author and asserting numFound >= 10.
BLOCKING: ratings_count is frequently missing from API responses (Red Team Check 1)
The API omits ratings_count entirely (not zero, just absent) for many works. Affected authors include:
| Author | Results with ratings_count present (out of 10) |
|---|---|
| Leo Tolstoy | 0/10 |
| Walt Whitman | 2/10 |
| Emily Dickinson | 2/10 |
| Rabindranath Tagore | 3/10 |
| Emily Bronte | 4/10 |
When ratings_count is the selected metric, parse_numeric(None) returns None, and GT correctly fails with "Missing 'ratings_count' for work...". But this means ~7% of authors (5/70) will produce GT failures for 1 of 3 metrics, silently reducing the effective variant space.
Fix required: Either:
- (a) Filter
AUTHOR_POOLto only authors where all 3 metrics are present in top 10 results, or - (b) Treat missing
ratings_countas 0 (with a documented rationale), or - (c) Remove
ratings_countas a metric and use onlywant_to_read_countandalready_read_count(which are present for nearly all works)
Option (a) is cleanest. The template test suite should include a test that validates all authors have the required fields.
BLOCKING: already_read_count not visible on search results page
The question text says: "Among the first N results, which book has the highest already-read count?"
But already_read_count is NOT displayed on the Open Library search results page. It only appears on individual work detail pages. This means:
- The agent cannot answer the question by reading the search results page alone
- The agent would need to click into each of the N detail pages, which is a much harder task than the template's
expected_stepssuggests (7-8 steps for scanning up to 10 detail pages?) - GT collection uses
api_datafrom the search API (which includesalready_read_count), but the agent sees the rendered page where this field is absent
This is a Data Visibility failure (CLAUDE.md Template Validation Checklist item 3: "Required data must appear in the page accessibility tree or visible content, not just in the API").
Fix required: Either:
- (a) Remove
already_read_countfrom the metric options and only usewant_to_read_countandratings_count(both visible on search results), or - (b) Redesign the templates to explicitly require visiting detail pages and adjust
expected_stepsaccordingly, or - (c) Verify that
already_read_countIS actually visible in the accessibility tree of search results (I may be wrong — check via Playwright snapshot)
NON-BLOCKING: allow_unsorted_fallback=True weakens GT determinism
The 3 new templates all call find_author_search_entry(..., allow_unsorted_fallback=True), while the existing author_editions template uses the default False. This means:
- If the agent searches without
sort=editions, the GT will still compute using unsorted results - The question explicitly says "sorted by most editions" — using unsorted data means the "top N results" in GT may differ from what the question describes
- This trades GT correctness for reduced
not_collectedrate
This is pragmatic but should be documented in a code comment explaining the tradeoff. The existing author_editions template intentionally does NOT use this fallback — the asymmetry should be justified.
NON-BLOCKING: pr_description.md committed to repository
This file should not be in the codebase. PR descriptions belong in the GitHub PR body only. Remove this file.
NON-BLOCKING: Template 97 (comparison) has 50% random baseline
Binary-choice questions ("which author has more X?") have a 50% random baseline. Per CLAUDE.md Red Team rules, binary questions need >70% world knowledge accuracy to fail. This passes, but it is the weakest template design.
Consider requiring the agent to report the actual sums for both authors (e.g., "Author A: 5,230, Author B: 3,410 → Author A") instead of just the winner name. This would drop the random baseline to near-zero and make validation more precise.
NON-BLOCKING: Red Team Review results not documented in PR
Per CLAUDE.md: "Results are documented in the PR review with pass/fail per check." The PR description does not include Red Team Review results. Here's the summary:
| Check | Template 96 | Template 97 | Template 98 |
|---|---|---|---|
| 1. API Semantic | PASS (missing field caveat) | PASS (missing field caveat) | PASS (missing field caveat) |
| 2. World Knowledge | PASS (~45%) | PASS (~55%) | PASS (~25%) |
| 3. Memorization Space | PASS (1,680) | PASS (14,490) | PASS (1,680) |
| 4. Answer Stability | MARGINAL PASS | MARGINAL PASS | MARGINAL PASS |
| 5. Random Baseline | PASS (10-33%) | PASS but weakest (50%) | PASS (9-17%) |
| 6. Cross-Parameter Collapse | PASS (0%) | PASS | PASS |
Summary
| Item | Status |
|---|---|
| 4 broken authors in AUTHOR_POOL | BLOCKING — remove or fix |
ratings_count missing for ~7% of authors |
BLOCKING — filter pool or drop metric |
already_read_count not visible on search page |
BLOCKING — verify visibility or drop metric |
allow_unsorted_fallback asymmetry |
Non-blocking, add comment |
pr_description.md in repo |
Non-blocking, remove |
| Template 97 binary baseline | Non-blocking, consider redesign |
| Tests | 65/65 passed ✓ |
| Code quality | Clean, good refactoring ✓ |
| No file conflicts with main | Confirmed ✓ |
…rop already_read_count, clean up BLOCKING fixes: - Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results: Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman, Dickinson, Tagore). Pool: 70 → 61. - Remove already_read_count from EngagementMetric, AuthorMetric, and ReaderMetric enums — not visible on search results page (only want_to_read and ratings counts are rendered). NON-BLOCKING fixes: - Add comment in author_editions.py documenting allow_unsorted_fallback asymmetry between existing and new templates. - Remove pr_description.md from repository. Tests updated to reflect metric and pool changes. 106 passed.
The OL API omits count fields (ratings_count, want_to_read_count) when the value is zero, rather than returning 0. Previously the GT methods returned GroundTruthResult.fail() for missing fields, causing hard failures for works that simply haven't been rated yet. Now treats absent metrics as 0.0, which is semantically correct and consistent with how the OL API represents zero-count data. This prevents GT failures for individual works missing ratings_count even among authors that generally have good data coverage. Also fixes _make_search_entry type hint (sort: Optional[str]) and removes unused title variables flagged by ruff.
If a metric field contains a non-numeric string like 'N/A', parse_numeric() returns None. Previously this None was passed to int(value) or numeric comparisons, causing a TypeError at runtime. Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None. This covers both absent fields (raw is None) and non-numeric strings (parse_numeric returns None). Adds regression test for 'N/A' metric values.
angosr
left a comment
There was a problem hiding this comment.
Third Review — Updated PR (commits 6d436d5, 2490e6a, 1608d88)
Good progress addressing the previous review. Here's where things stand.
Resolved Issues ✓
| Previous Issue | Resolution |
|---|---|
| 4 broken authors | Fixed — Removed Dostoevsky, Murakami, Chekhov, Octavia Butler, Emily Bronte, Leo Tolstoy, Walt Whitman, Emily Dickinson, Rabindranath Tagore (9 total). Pool is now 61. All 61 return ≥10 results. |
already_read_count not visible on search page |
Fixed — Dropped from all 3 templates. Only want_to_read_count and ratings_count remain. |
pr_description.md in repo |
Fixed — Deleted. |
allow_unsorted_fallback asymmetry |
Fixed — Added explanatory comment in author_editions.py (line 183-186). |
| Variant space claims | Verified — 976, 7320, 976 all correct and >500. |
| Tests | 107/107 passed ✓ |
| No file conflicts with main | Still clean ✓ |
BLOCKING: ratings_count missing-as-zero creates unreliable GT for "lowest" extrema
I queried all 61 authors: 34/61 (56%) have at least one missing ratings_count in their top 10 results. Examples:
| Author | Missing ratings_count (out of 10) |
|---|---|
| James Joyce | 5/10 |
| Ralph Waldo Emerson | 5/10 |
| Herman Melville | 4/10 |
| Philip K. Dick | 4/10 |
| Jane Austen | 3/10 |
When extrema=lowest and metric=ratings_count, the missing-as-zero treatment means the answer will always be the alphabetically first book with a missing field — not the book with genuinely fewest ratings. This is semantically wrong: the question asks "which book has the fewest ratings" but GT answers "which book has missing API data."
The agent, looking at the actual search results page, may see a different ratings_count displayed (the page shows it even when the API omits it, via schema.org markup), creating a GT/agent disagreement.
In contrast, want_to_read_count is missing for only 4/61 authors — far less problematic.
Three possible fixes (pick one):
(a) Drop ratings_count entirely — use only want_to_read_count. This cuts variant space to 488 (T96: 61×2×1×4), 244 (T98: 61×1×4×2), 3660 (T97: C(61,2)×1×2). T96 and T98 fall below the 500 minimum. Would need to compensate by expanding result counts or adding back more authors.
(b) Exclude ratings_count from extrema-lowest only — keep it for highest-extrema, comparison, and filter where the bias is negligible. This requires per-template metric filtering, adds complexity.
(c) Filter AUTHOR_POOL to only authors with 0 missing ratings_count in top 10 — this removes 34 authors, leaving 27. Pool too small.
My recommendation: option (a) + expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98. This gives T96: 61×2×1×5 = 610 (>500), T98: 61×1×5×2 = 610 (>500), T97: C(61,2)×1×2 = 3660 (>500). Clean and simple.
Alternatively, if you want to keep ratings_count, option (b) is acceptable — but document the asymmetry.
NON-BLOCKING: Non-numeric metric treatment is silent
Commit 1608d88 treats non-numeric values (e.g., "N/A") as 0:
raw = work.get(metric)
parsed = parse_numeric(raw) if raw is not None else None
value = parsed if parsed is not None else 0.0This is fine for missing fields (None), but silently treating a string like "N/A" as 0 violates the No Fallback principle. If the API returns a non-null, non-numeric value, something unexpected is happening and should raise. Consider:
raw = work.get(metric)
if raw is None:
value = 0.0 # API omits zero-valued counts
else:
parsed = parse_numeric(raw)
if parsed is None:
return GroundTruthResult.fail(f"Non-numeric '{metric}' value '{raw}' for '{title}'")
value = parsedThis preserves the missing-as-zero behavior while still catching actual data corruption.
NON-BLOCKING: Template 97 binary baseline (50%) — still the weakest design
Still a binary-choice question. Consider but not blocking.
Summary
| Item | Status |
|---|---|
ratings_count missing in 56% of authors → wrong "lowest" GT |
BLOCKING |
| Non-numeric silently treated as 0 | Non-blocking, recommend fail on non-null non-numeric |
| Broken authors removed | ✓ Resolved |
already_read_count dropped |
✓ Resolved |
pr_description.md deleted |
✓ Resolved |
allow_unsorted_fallback documented |
✓ Resolved |
| Tests pass | ✓ 107/107 |
The 3-line metric normalization pattern (raw → parse_numeric → fallback to 0.0) was duplicated across all 3 new templates. Extracted to safe_metric_value() in common.py, reducing each call site to a single line and ensuring consistent handling of absent/non-numeric fields.
BLOCKING: ratings_count is missing for 56% of authors in the OL API, causing wrong GT for extrema-lowest queries (missing-as-zero always wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and ReaderMetric — all templates now use only want_to_read_count. Expanded RESULT_COUNTS to keep variant space above 500 minimum: - T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610 - T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660 - T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732 NON-BLOCKING: safe_metric_value now raises ValueError on non-null non-numeric values (e.g. 'N/A') instead of silently treating them as 0. Missing (None) values still default to 0. Callers catch ValueError and surface it as GroundTruthResult.fail().
angosr
left a comment
There was a problem hiding this comment.
Fourth Review — commit c99582f (extract safe_metric_value helper)
This commit is a clean refactoring: extracts the duplicated missing-metric-as-zero logic into safe_metric_value() in common.py. No behavioral change. Tests pass (107/107).
The refactoring itself is fine — the helper is well-named and the docstring clearly explains the rationale.
Previous blocking issue is NOT addressed
The ratings_count reliability problem from the third review remains:
- 34/61 authors (56%) have missing
ratings_countin their top-10 API results safe_metric_valuetreats missing as 0.0- When
extrema=lowest+metric=ratings_count: GT answer = alphabetically first book with missing API data, NOT the book the agent sees as having fewest ratings on the page - This is a semantic correctness issue, not a crash risk
The refactoring consolidates the logic into one place (good), but the underlying data problem is unchanged.
What's still needed
One of:
(a) Drop ratings_count — keep only want_to_read_count (missing for only 4/61 authors). Expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98 to maintain variant space above 500:
- T96: 61 × 2 × 1 × 5 = 610
- T97: C(61,2) × 1 × 2 = 3,660
- T98: 61 × 1 × 5 × 2 = 610
(b) Exclude ratings_count from ExtremaType.LOWEST — keep it for highest/comparison/filter where the bias is negligible. This means the metric pool depends on the extrema direction, which adds complexity but preserves variant space.
(c) Accept the risk and document it — add a code comment noting that lowest + ratings_count on authors with sparse data produces unreliable GT, and accept this as a known limitation. I would accept this approach if accompanied by a test that demonstrates the behavior so it's not silently forgotten.
Status
| Item | Status |
|---|---|
safe_metric_value helper |
✓ Clean refactoring |
| Tests | ✓ 107/107 |
ratings_count missing in 56% of authors |
Still blocking — not addressed |
| Everything else from previous reviews | ✓ Resolved |
@angosr this is already pushed and on the branch. please review again |
…son/filter - Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py that still mentioned 'ratings' after ratings_count was dropped. - Add non-numeric metric regression tests for comparison and filter templates to match the existing extrema test, ensuring all 3 safe_metric_value call sites are explicitly tested for ValueError handling.
angosr
left a comment
There was a problem hiding this comment.
Fifth Review — commits 5dceaec, c5fa4e1
All previously blocking issues are resolved. Approving with minor notes.
Resolved Issues ✓
| Issue | Resolution |
|---|---|
ratings_count missing in 56% of authors |
Fixed — Dropped from all 3 templates. Only want_to_read_count remains. |
| Variant space below 500 after dropping metric | Fixed — RESULT_COUNTS expanded to [3,5,7,10,15] (T96/T98) and [5,10,15] (T98). Verified: T96=610, T97=3660, T98=732. All >500. |
| Silent non-numeric fallback | Fixed — safe_metric_value now raises ValueError on non-null non-numeric, caught as GroundTruthResult.fail() in all 3 templates. |
| Tests | 109/109 passed ✓ |
| Merge against main | Clean, no conflicts ✓ |
Remaining minor concern (NON-BLOCKING)
want_to_read_count is still missing for 10/61 authors in their top-15 results (mostly at high rank positions — rank 8-15). With safe_metric_value treating missing as 0:
work_count=3: 0/61 authors affected (all missing values rank >5)work_count=5: 1/61 (Ralph Waldo Emerson, missing at rank 5)work_count=10: 4/61 affectedwork_count=15: 10/61 affected
For extrema=lowest, this means ~2.6% of T96 variants may produce a GT answer based on missing data rather than genuine lowest value. For T97 (comparison) and T98 (filter), the impact is negligible (slight undercount).
This is acceptable given: (a) the affected fraction is small, (b) highest extrema is unaffected, (c) comparison/filter impact is minimal. No action needed now, but consider removing Ralph Waldo Emerson (rank-5 missing) in a future cleanup if live eval shows GT failures for that author.
Final verification summary
ratings_count/already_read_countfully removed from all 3 templates ✓safe_metric_value: None → 0.0, non-numeric → ValueError → GT fail ✓- Variant space: 610, 3660, 732 (math verified) ✓
- AUTHOR_POOL: 61 authors, all return ≥15 results ✓
pr_description.mddeleted ✓allow_unsorted_fallbackdocumented in author_editions.py ✓- Clean merge against main (no conflicts with c087f89) ✓
angosr
left a comment
There was a problem hiding this comment.
Sixth Review — Revoking approval after anti-memorization deep-dive
After approval I performed a thorough anti-memorization stress test. The results reveal a fundamental design weakness that was not visible in earlier reviews focused on code correctness.
BLOCKING: All 3 templates are fully enumerable from 61 API calls
After dropping ratings_count and already_read_count, every template uses a single metric (want_to_read_count) over a fixed pool of 61 authors. This makes the entire answer space trivially enumerable:
Attack procedure:
- Call
https://openlibrary.org/search.json?q=author:"<name>"&sort=editions&limit=15&fields=title,want_to_read_countfor each of the 61 authors (~30 seconds total) - Compute all answers offline:
- T96: 61 authors × 5 work_counts × 2 extrema = 610 entries
- T97: C(61,2) × 2 work_counts = 3,660 entries
- T98: 61 authors × 4 thresholds × 3 work_counts = 732 entries
- Total lookup table: ~5,000 entries — small enough to embed in a system prompt
Answer stability makes this worse: want_to_read_count is cumulative (only grows), at a rate of ~5-75 per year depending on the book. The lookup table remains valid for months to years. The "highest" winner for a given author almost never changes (typical gap between #1 and #2 is hundreds to thousands).
This violates CLAUDE.md Red Team Check 4 (Answer Stability): "If the same question produces the same answer for >30 days, the template's anti-memorization claim is weak. Must combine with a large variant space (>1000) to compensate." T96 (610) and T98 (732) are below the 1000 compensating threshold.
BLOCKING: T97 is vulnerable to world knowledge attack (~65%)
For the comparison template, the question is binary: "which author has more total want-to-read?" An LLM can estimate this from general fame:
| Pair type | LLM accuracy | Proportion of pairs |
|---|---|---|
| Clear fame gap (e.g. J.K. Rowling vs Samuel Beckett) | ~95% | ~40% |
| Cross-tier (e.g. Stephen King vs Jules Verne) | ~75% | ~35% |
| Same tier (e.g. Franz Kafka vs Ray Bradbury) | ~50% | ~25% |
| Weighted average | ~65% |
The CLAUDE.md threshold for binary questions is 70%. At ~65%, this technically passes but is dangerously close. Combined with the lookup table vulnerability above, the template offers minimal anti-memorization value.
Verified data point: the top-5 want_to_read_count ranking (J.K. Rowling > Stephen King > George Orwell > Jane Austen > Roald Dahl) is largely predictable from world knowledge. P.G. Wodehouse at dead last (73 total) vs J.K. Rowling (35,278) is a 483× gap that any LLM can identify.
Root cause: single metric + slow-changing data + small entity pool
The original design with 3 metrics over 70 authors was borderline. After necessary data-quality fixes (removing broken authors, dropping unreliable metrics), the remaining design space collapsed:
| Dimension | Original | After fixes | Problem |
|---|---|---|---|
| Authors | 70 | 61 | Still OK |
| Metrics | 3 | 1 | Eliminates combinatorial diversity |
| Data volatility | "Dynamic" | ~75/year growth | Effectively static |
Recommended path forward
The templates are well-engineered (clean code, good GT logic, proper edge case handling). The problem is the data source, not the implementation. Options:
(a) Add a truly dynamic dimension. For example, instead of absolute want_to_read_count, use relative ranking changes: "Among Author X's top N books, which moved up the most in want-to-read count compared to Y days ago?" This requires storing historical snapshots but makes lookup tables expire quickly.
(b) Expand the entity pool dramatically. The AUTHOR_POOL could be generated dynamically from OL's trending/popular lists rather than a fixed list of 61. If the pool is 500+ authors sampled at eval time, enumeration becomes impractical.
(c) Combine with a second data source. Cross-site templates (like the existing hybrid templates) are much harder to enumerate because the answer depends on correlated data from two independent sites.
(d) Restore ratings_count with targeted exclusions. Instead of dropping it globally, exclude it only from extrema=lowest queries (where missing-as-zero causes wrong GT). This restores the metric dimension for ~80% of variants while avoiding the data quality issue. Combined with filtering out the 5 worst-affected authors (Emerson, Joyce, Melville, Hawthorne, P.K. Dick), this brings variant space back to meaningful levels with real diversity.
Summary
| Issue | Status |
|---|---|
| Code quality, tests, edge cases | ✓ Excellent |
| GT correctness | ✓ Fixed across 5 rounds |
| Lookup table: 61 API calls → all answers | BLOCKING |
| T96/T98 variant space + stability < 1000 threshold | BLOCKING |
| T97 world knowledge ~65% on binary choice | Borderline, not blocking alone |
…tion BLOCKING: With a single metric (want_to_read_count), the entire answer space was enumerable from 61 API calls (~5,000 entries). Restoring ratings_count as a second metric dimension breaks trivial enumeration. Changes: - Remove 5 authors with worst ratings_count coverage (Emerson, Joyce, Melville, Hawthorne, P.K. Dick). Pool: 61 → 56. - Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric. - T96: exclude ratings_count from extrema=lowest only (where missing-as-zero would always win). Highest/comparison/filter are unaffected by the bias. - T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values). - Restore THRESHOLDS for ratings_count in T98. Variant spaces (all >1000): - T96: 56 × (highest×2 + lowest×1) × 6 = 1,008 - T97: C(56,2) × 2 × 2 = 6,160 - T98: 56 × 2 × 4 × 3 = 1,344 Adds test_extrema_lowest_excludes_ratings_count to verify the per-extrema metric filtering. 364 tests pass.
…t space - Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization - Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25] to increase lowest-extrema differentiation - Effective variant space: ~583 (16.6% margin above 500 threshold) - Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants - Fix AUTHOR_POOL section comments to reflect actual counts - Split test file (481+490 lines, both <500) - Remove unused get_registered_templates import - Add tests: pool size=81, no duplicates, ratings_count GT
The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing guaranteed GT failure for 1/7 of T96 variants. Raise limit to match. Add regression test: test_extrema_gt_succeeds_with_25_works
angosr
left a comment
There was a problem hiding this comment.
Seventh Review — commits 6db81ec, 050b5d9, 4e0fc5c
Rigorous anti-memorization analysis performed before approving this time.
Changes in this iteration
-
ratings_countrestored with targeted exclusion — excluded only fromExtremaType.LOWEST(where missing-as-zero would always falsely "win"). Used for highest-extrema, comparison, and filter. Verified across 2000 seeds: lowest never getsratings_count. -
AUTHOR_POOL expanded to 81 (+25 new authors: Charlotte Bronte, Jack London, Dan Brown, C.S. Lewis, etc.). All 81 verified to return ≥25 results from the API.
-
RESULT_COUNTS expanded to [3,5,7,10,15,20,25] for T96, [5,10,15] for T98. API fetch limit raised from 20→25 in
openlibrary.pyto supportwork_count=25. -
Tests: 115/115 passed.
Anti-memorization verification (Red Team Checks)
Check 1 — API Semantic Verification: PASS
sort=editionsconfirmed working correctlywant_to_read_countpresent for majority of books (4/81 authors have >5/25 missing)ratings_counthas more gaps (10/81 authors have >15/25 missing) but only used forhighest/comparison/filter where missing-as-zero is benign
Check 2 — World Knowledge Attack: PASS
- T96 highest: ~60% (rank-1 by editions = highest wtr 60% of the time, but different
work_countvalues change the answer — e.g., Christie top-3 winner ≠ top-7 winner) - T96 lowest: ~random (no world knowledge helps)
- T97: ~65% (binary, threshold is 70%) — borderline but passes. Two metrics help: wtr winner ≠ rc winner for some pairs
- T98: ~25% (numeric answer 0-25, unpredictable)
Check 3 — Memorization Space: PASS
- T96: 1,701 variants (verified: 81×2×7 + 81×1×7)
- T97: 12,960 variants (verified: C(81,2)×2×2)
- T98: 1,944 variants (verified: 81×(4+4)×3)
- All >1,000 (CLAUDE.md compensating threshold for slow-changing data)
Check 4 — Answer Stability: MARGINAL PASS (compensated)
want_to_read_countgrows at ~5-75/year depending on book popularity- Answers stable for weeks-to-months for most variants
- Compensated by variant space >1,000 per CLAUDE.md: "Must combine with large variant space (>1000) to compensate"
Check 5 — Random Baseline: PASS
- T96: 1/N where N∈{3,5,7,10,15,20,25} → 4-33%
- T97: 50% (binary) — under 70% combined threshold for binary questions
- T98: 1/(N+1) where N∈{5,10,15} → 6-17%
Check 6 — Cross-Parameter Collapse: PASS
- Different authors produce genuinely different wtr values (73 to 35,278 for top-5 sums)
- Different
work_countvalues change the answer for the same author (verified: Christie top-3 highest ≠ top-7 highest) - Two metrics provide independent variation axes
Lookup table concern — accepted limitation
81 API calls can still enumerate all ~16,600 entries. However:
- This is inherent to ALL Open Library templates (existing
author_editionsis equally enumerable from the same API calls with only 324 entries) - CLAUDE.md explicitly allows this pattern when compensated by >1,000 variant space
- The new templates are actually harder to enumerate than existing OL templates (16,605 vs 324 entries)
Remaining non-blocking notes
- Bram Stoker has 9/25 missing
want_to_read_count— worst in the pool. Withwork_count=25, 9 books treated as 0 creates moderate undercount for filter/comparison. Consider removing in future cleanup if live eval shows issues. - T97 binary baseline (50%) remains the weakest design point. Documented, not blocking.
Final status
| Item | Status |
|---|---|
| Code quality & tests | ✓ 115/115 passed |
| GT correctness | ✓ Targeted exclusion for lowest+ratings_count |
| Red Team Check 1 (API Semantic) | ✓ PASS |
| Red Team Check 2 (World Knowledge) | ✓ PASS (<60%/70%) |
| Red Team Check 3 (Memorization Space) | ✓ PASS (1701, 12960, 1944) |
| Red Team Check 4 (Answer Stability) | ✓ MARGINAL PASS (compensated >1000) |
| Red Team Check 5 (Random Baseline) | ✓ PASS |
| Red Team Check 6 (Cross-Parameter Collapse) | ✓ PASS |
| Merge against main | ✓ Clean, no conflicts |
| AUTHOR_POOL (81) all return ≥25 results | ✓ Verified |
openlibrary.py limit raised to 25 |
✓ Matches max work_count |
angosr
left a comment
There was a problem hiding this comment.
Eighth Review — Revoking approval (again). Three issues found.
I re-examined more carefully before this approval would stand. Found three problems I missed.
BLOCKING: Shared AUTHOR_POOL mutation breaks existing template reproducibility
AUTHOR_POOL in author_editions.py is imported by the existing author_editions template AND the 3 new templates. This PR removes 14 authors and adds 25, changing the pool from 70 → 81.
10 of the 14 removed authors worked fine for author_editions (they only had want_to_read_count / ratings_count issues, not edition_count issues):
| Removed Author | numFound | Why removed | Needed for author_editions? |
|---|---|---|---|
| Herman Melville | 800+ | wtr missing at high ranks | ✓ Yes, edition_count fine |
| Nathaniel Hawthorne | 600+ | wtr missing at high ranks | ✓ Yes |
| Philip K. Dick | 300+ | wtr missing at high ranks | ✓ Yes |
| James Joyce | 500+ | wtr missing at high ranks | ✓ Yes |
| Ralph Waldo Emerson | 1000+ | wtr missing at rank 5 | ✓ Yes |
| Emily Bronte | 400+ | weak ratings_count | ✓ Yes |
| Leo Tolstoy | 29 | weak ratings_count | ✓ Yes |
| Walt Whitman | 1200+ | weak ratings_count | ✓ Yes |
| Emily Dickinson | 600+ | weak ratings_count | ✓ Yes |
| Rabindranath Tagore | 2000+ | weak ratings_count | ✓ Yes |
Since author_editions.generate() uses rng.choice(AUTHOR_POOL), changing the pool changes what seed=42 produces. This means:
- Existing evaluation results become non-reproducible
- Cached evaluations from before the merge produce different questions
- Any SFT training data generated with the old pool is invalidated
Fix: Create a separate ENGAGEMENT_AUTHOR_POOL for the new templates. Keep the original AUTHOR_POOL unchanged for author_editions. The new pool can be a curated subset + additions.
BLOCKING: want_to_read_count has the same missing-as-zero problem at high work_counts
I verified all 81 authors at limit=25: 33/81 (41%) have missing want_to_read_count in their top 25 results.
Worst cases for lowest extrema + want_to_read_count:
| Author | Missing wtr in top 25 |
|---|---|
| Kazuo Ishiguro | 10/25 |
| Miguel de Cervantes | 9/25 |
| Bram Stoker | 9/25 |
| Virginia Woolf | 5/25 |
| Mary Shelley | 5/25 |
With work_count=25 and extrema=lowest, Kazuo Ishiguro has 10 books with missing wtr treated as 0. The GT answer = alphabetically first among 10 missing books — the exact same semantic bug that was blocking for ratings_count.
The _LOWEST_METRICS exclusion correctly blocks ratings_count from lowest, but want_to_read_count has the same problem at work_count ≥ 10.
Fix: Cap RESULT_COUNTS for lowest extrema at a safe level. From the data:
work_count=3: 0/81 authors affectedwork_count=5: 1/81 (Kazuo Ishiguro, missing at rank 10 — actually safe at 5)work_count=7: still safe for most
Alternatively, add _LOWEST_RESULT_COUNTS = [3, 5, 7] and use the full range only for highest. This parallels the metric exclusion approach already in place.
NON-BLOCKING: openlibrary.py limit 20→25 is a global side effect
fetch_search_api_data(query, limit=25, ...) applies to ALL search page API fetches, not just new templates. This means:
- Existing templates get 25 works in
api_datainstead of 20 - Cache entries become larger
- API calls are slightly slower
This should work fine in practice (more data is generally better for GT), but it's an undocumented side effect. Add a comment explaining why the limit was raised.
Summary
| Issue | Severity |
|---|---|
| AUTHOR_POOL mutation breaks existing template reproducibility | BLOCKING |
want_to_read_count missing at work_count ≥ 10 for lowest extrema |
BLOCKING |
Global limit=25 in openlibrary.py |
Non-blocking, add comment |
…COUNTS Address PR review AffineFoundation#8: 1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main to preserve author_editions reproducibility. Create separate ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98. 2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to avoid missing-as-zero domination of want_to_read_count at high work_counts (41% of authors affected at work_count=25). 3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py. Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.
Avoid collision with PR AffineFoundation#13 (OpenLibrary) which claims IDs 96-98.
Avoid collision with PR AffineFoundation#13 (OpenLibrary) which claims IDs 96-98.
angosr
left a comment
There was a problem hiding this comment.
Review: PR #13 — feat(openlibrary): add 3 engagement & comparison templates
Verdict: REQUEST CHANGES — Fundamental design issues + Red Team failures
BLOCKING: allow_unsorted_fallback=True violates No Fallback rule
CLAUDE.md §9: "All evaluation logic must be deterministic. Never add fallback/default values to mask errors."
All three new templates use find_author_search_entry(..., allow_unsorted_fallback=True), meaning if the agent searches without sort=editions, the GT falls back to unsorted search data. This is fundamentally wrong:
- The question explicitly says "sorted by most editions"
- Unsorted results may have completely different ranking, changing which books are in top N
- GT computed from unsorted data can produce a different answer than the correctly-sorted page shows
- This masks agent errors (wrong navigation) as correct answers
The existing author_editions template correctly does NOT use this fallback. The comment in the code even acknowledges the asymmetry. If agents need sort=editions, the template's start_url already navigates them there — fallback is unnecessary.
Fix: Remove allow_unsorted_fallback=True from all three templates. If GT returns not_collected because the agent didn't visit the sorted page, that's the correct outcome.
BLOCKING: safe_metric_value returning 0.0 for None violates No Fallback
safe_metric_value(work, metric) returns 0.0 when the metric field is absent. This is defended as "OL API omits count fields when the value is zero" but:
- This is an assumption about API behavior, not a documented guarantee
- For
ratings_count, absence is common and doesn't always mean zero — it can mean the work page hasn't been rated yet (which is different from "has 0 ratings") - For T96 (engagement_extrema with extrema=lowest), missing-as-zero directly determines the winner — the PR even acknowledges this by capping
_LOWEST_RESULT_COUNTSto [3,5,7] and excluding ratings_count from lowest. This is a patch on a patch — the root issue is that 0.0 default is unreliable
Verified via API: for authors like Daphne du Maurier, 40% of top-10 results have ratings_count missing/zero. For these works, the template blindly assigns 0, which may be wrong.
Fix: For ratings_count, treat missing as GT failure (not 0). Only want_to_read_count can arguably default to 0 since it has much higher coverage.
BLOCKING: T97 (author_comparison) fails Red Team Check 5 — 50% random baseline
T97 asks "Which author has higher X?" — the answer is one of two names. This is a binary choice with 50% random baseline.
CLAUDE.md Red Team §5: "Templates with >33% random baseline must require exact numeric answers or multi-step computation to compensate."
T97's answer is an author name (not numeric). A coin flip gets 50% correct. The PR's Red Team section does not even address this check for T97. This is a blocking failure.
Fix: Redesign T97 to require a numeric answer (e.g., "What is the difference in total want-to-read counts between Author A and Author B?") or restructure to avoid binary outcomes.
BLOCKING: No eval.py test results
CLAUDE.md §8: "Every new template must be tested via eval.py with multiple seeds (10-minute timeout)."
The PR only shows unit test results (360 passed). No eval.py run is documented. Unit tests with mocked data cannot verify:
- GT data source binding (does GT use page cache data?)
- Agent solvability (can a real agent navigate to the right pages?)
- Data visibility (are engagement metrics readable in the accessibility tree?)
BLOCKING: Version 7 conflict with PR #14
Both PR #13 and PR #14 register templates as "Version 7" in task_registry.py. They will conflict on merge. Coordinate IDs.
Non-blocking issues
-
ENGAGEMENT_AUTHOR_POOL duplication: 56 of 81 authors duplicate AUTHOR_POOL entries. Consider making ENGAGEMENT_AUTHOR_POOL a superset reference rather than a copy, to avoid drift.
-
ratings_count data sparsity verified: API calls confirm 20-40% of top-10 results for engagement-only authors (du Maurier, Collins, Forster, Mann) have missing ratings_count. Even after removing 5 authors, the pool still contains authors with significant sparsity.
-
T96 variant space margin: T96 claims 1,377 variants but the effective space for lowest-extrema is only 81 × 1 × 3 = 243 (below the 500 threshold). The combined count (highest + lowest) passes, but lowest-extrema alone does not.
API Verification Results
Verified against live OL API:
- ✅
want_to_read_countIS visible on search pages (e.g., "620 Want to read" for Agatha Christie) - ✅
ratings_countIS visible (e.g., "4.0 (84 ratings)") - ✅
already_read_countcorrectly dropped (only "Already Read" button shown, no count) - ✅ Cross-parameter collapse check passes: different authors produce different extrema winners
- ❌ ratings_count sparsity remains problematic for 20-40% of engagement-only authors
Required Actions
- Remove
allow_unsorted_fallback=Truefrom all three templates - Make
safe_metric_valuefail on missingratings_countinstead of defaulting to 0 - Redesign T97 to avoid binary-choice answer format (Red Team Check 5)
- Run eval.py with each template and multiple seeds; document results
- Coordinate Version 7 numbering with PR #14
…ic GT, numeric T97, strict metrics BLOCKING fixes: - Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98). GT now strictly requires sort=editions data, matching the question text. If the agent doesn't visit the sorted page, GT correctly returns not_collected instead of silently using wrong-order results. - Make safe_metric_value fail on missing ratings_count instead of defaulting to 0. Only want_to_read_count (high API coverage) defaults to 0 when absent. ratings_count absence raises ValueError → GT fail, preventing semantically wrong answers from sparse data. - Redesign T97 (author_comparison) from binary "which author has more?" (50% random baseline) to numeric "what is the absolute difference?" (near-0% random baseline). GT returns str(abs(sum_a - sum_b)). - Add Version 7 coordination comment for PR AffineFoundation#14 (IDs 99-101 → Version 8). NON-BLOCKING fixes: - Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set + additions list, eliminating 56-entry duplication and preventing drift. AUTHOR_POOL itself is unchanged (author_editions reproducibility). - Remove stale allow_unsorted_fallback asymmetry comment from author_editions.py (all templates now consistently use strict sort). Tests: 372 passed (118 OpenLibrary, 254 other).
…il from sparse OL data
ratings_count is missing for 20-40% of authors at N≥7. Restrict
ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where
coverage is highest, cutting estimated GT-fail exposure from
~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged.
angosr
left a comment
There was a problem hiding this comment.
Re-review: PR #13 — Design issues RESOLVED, eval.py still required
All previously blocking design issues have been addressed:
Resolved issues
-
allow_unsorted_fallback removed from all templates ✅ — Verified: all three template GT methods (
author_engagement_extrema,author_comparison,reading_stats_filter) callfind_author_search_entry(collected, search_query=..., sort=sort)withoutallow_unsorted_fallback. The parameter still exists incommon.py(defaultFalse) but is only exercised in test code. GT is now deterministic — if the agent doesn't visit the correctly-sorted page, GT correctly returnsnot_collected. -
T97 redesigned from binary to numeric ✅ — Now asks "What is the absolute difference?" with GT returning
str(abs(sum_a - sum_b)). Random baseline is near 0% (numeric answer space). Red Team Check 5 is satisfied. -
safe_metric_value strict for ratings_count ✅ —
_ZERO_DEFAULTABLE_METRICS = frozenset({"want_to_read_count"}). Missingratings_countraisesValueError→ GT fail. This is the correct behavior: don't assume 0 for sparse data. -
ratings_count variants capped to low N ✅ — Smart mitigation: restrict ratings_count to
N∈{3,5}where coverage is highest, reducing GT-fail exposure without dropping the metric entirely. -
ENGAGEMENT_AUTHOR_POOL derived from AUTHOR_POOL ✅ — Eliminates 56-entry duplication and prevents drift.
-
Version 7 coordination ✅ — Comment added noting PR #14 must use Version 8.
Remaining BLOCKING issue
No eval.py test results documented.
CLAUDE.md §8: "Every new template must be tested via eval.py with multiple seeds (10-minute timeout)."
This is a MUST requirement for new templates. Unit tests with mocked data cannot verify:
- GT data source binding with real page cache
- Whether engagement metrics appear in the accessibility tree that the agent reads
- Agent navigation feasibility (can a browser agent actually find and read these metrics?)
Required before merge: Run eval.py with templates 96, 97, 98 individually, multiple seeds each, and document:
- GT success rate (how often does GT compute a value vs fail/not_collected?)
- Whether
api_datacomes from the page cache (check logs for[GT] Visit xxx → +N items) - Any
not_collectedor unexpected GT failures
Assessment
The template design is now sound. The fixes are thorough and well-reasoned (especially the ratings_count cap strategy). Once eval.py results are provided, this PR is ready for final review.
eval.py Results — Templates 96, 97, 98Config: model=gpt-4o-mini, base-url=api.openai.com/v1, timeout=600s, seeds=1,2,3 Results (3 templates × 3 seeds = 9 runs)
Key Findings1. GT data source binding works ✅ The GT collector successfully extracts engagement metrics ( 2. Sort enforcement is correct ✅ GT correctly distinguishes sorted vs unsorted data and rejects the wrong sort order. 3. No unexpected GT failures ✅ 4. Engagement metrics are visible in accessibility tree ✅ 5. Agent navigation is the bottleneck (model limitation) Sample LogsT96 seed=1 (representative): ConclusionThe eval confirms the template design is sound:
The 0/9 GT-success rate is entirely due to |
angosr
left a comment
There was a problem hiding this comment.
Re-review (3rd pass): PR #13 — eval.py posted, but GT computation never verified
Thank you for running eval.py and posting detailed results. The eval confirms the GT collection pipeline and sort enforcement work correctly. However, a critical gap remains.
BLOCKING: 0/9 GT success — end-to-end GT computation unverified
All 9 eval runs returned not_collected because gpt-4o-mini navigated to unsorted pages. This means the actual GT computation logic — safe_metric_value, extrema finding, comparison sums, threshold counting — was never exercised with real OL API data.
CLAUDE.md §5 Template Validation Checklist item 1: "GT Calculation - GT must return a concrete value. If it errors, the template is broken."
With 0 GT successes, we cannot confirm item 1. The unit tests use mocked data which may not reflect real API response structure (field names, nesting, types).
The eval correctly shows "agent fails + GT gated at sort check" which proves the sort enforcement works. But it does NOT prove "agent fails + GT succeeds" — GT never reached the computation stage.
Required: at least 1 GT success per template
Options (any one is sufficient):
(a) Re-run eval.py with a stronger model (e.g., gpt-4o, claude-sonnet-4-6) that can navigate to sorted pages. The start_url IS the sorted search URL — a capable agent should visit it and trigger GT collection with the correct sort order.
(b) Manual GT verification with real API data: Call the OL API directly for 2-3 authors with sort=editions, inject the response into a test as realistic cache data, and run get_ground_truth(). This verifies the computation path without needing a full agent run. Example:
# Fetch real data
curl "https://openlibrary.org/search.json?q=author%3A%22agatha+christie%22&sort=editions&limit=10"
# Inject into test, verify GT returns a concrete value(c) Run eval.py with the agent forced to use start_url — if there's a config to ensure the agent starts from the provided URL rather than navigating from scratch, that would naturally produce sorted page visits.
Everything else is solid
- Sort enforcement: ✅ verified by eval
- GT collection pipeline: ✅
+25 works (search)on every visit - Engagement metric visibility: ✅ confirmed in accessibility tree
- Error diagnostics: ✅ clear rejection messages
- Template design: ✅ all previous issues resolved
Status
The design is approved. One verified GT computation away from final approval.
Fetch live data (March 26, 2026) for Agatha Christie, Stephen King, and Neil Gaiman via sort=editions search API. Inject into GT collector and verify all three templates (T96/T97/T98) return concrete values with both want_to_read_count and ratings_count metrics. 12 tests cover: highest/lowest extrema, cross-author numeric difference, and threshold counting — satisfying CLAUDE.md §5 item 1.
GT Computation Verified with Real OL API DataPer @angosr's request for at least 1 GT success per template — used option (b): fetched real OL API data and verified GT returns concrete values. Data SourceLive API calls on March 26, 2026: Authors: Agatha Christie, Stephen King, Neil Gaiman — all 10 works per author have both Results: 12/12 GT computations return concrete values ✅
What this verifies
Commit: |
angosr
left a comment
There was a problem hiding this comment.
Final Review (4th pass): PR #13 — APPROVE
All blocking issues resolved. GT computation verified end-to-end with real data.
Verification summary
| Issue | Status | Evidence |
|---|---|---|
| allow_unsorted_fallback | ✅ Removed | All template GT calls use strict sort matching |
| T97 binary choice (Red Team §5) | ✅ Fixed | Redesigned to absolute numeric difference |
| safe_metric_value ratings_count | ✅ Strict | Missing ratings_count → ValueError → GT fail |
| GT computation (CLAUDE.md §5.1) | ✅ Verified | 12/12 GT computations with real OL API data return concrete values |
| Version 7 coordination | ✅ Done | Comment directs PR #14 to Version 8 |
Cross-verification
I independently verified the GT results against my own earlier API calls:
- T96: Murder on the Orient Express has wtr=1355 → correctly identified as highest among Christie top 5 ✅
- T97:
abs(3309 - 19430) = 16121— arithmetic confirmed ✅ - T98: Christie top 5 with wtr>500 = {Styles(620), Orient Express(1355), Ackroyd(699)} = 3 ✅
Note on eval.py
The eval.py run (gpt-4o-mini, 0/9 GT success) confirmed sort enforcement and data collection work correctly. The 0% agent success rate is a model capability issue — gpt-4o-mini doesn't translate "sorted by most editions" into &sort=editions. The real API GT verification (option b) adequately demonstrates the computation pipeline works. This is sufficient.
Problem
The Open Library plugin previously lacked templates for cross-author comparison, engagement extrema, and threshold-based engagement filtering.
Solution
Add 3 new OpenLibrary templates using only confirmed-visible, dynamic engagement metrics (
ratings_count,want_to_read_count,already_read_count) and require multi-step reasoning over live search results.New templates
openlibrary_author_engagement_extremaopenlibrary_author_comparisonopenlibrary_reading_stats_filterWhy this is RL-suitable (not fixed-pattern SFT)
Refactoring / Duplication reduction
Moved shared author-query matching helpers (
normalize_author_fragment,extract_author_filter,find_author_search_entry) intotemplates/common.pyand reused them across author templates.Live-eval robustness improvement
Added a controlled fallback in
find_author_search_entry:sort=editions) when availableThis preserves strict behavior by default and only enables fallback where needed (
96/97/98) to reduce fragilenot_collectedoutcomes in live runs.Testing
tests/plugins/openlibrary/test_engagement_templates.py: 65 passedtests/plugins/openlibrary: 106 passedAdded tests cover:
Edge Cases Handled
>(equality not counted)"0"valid