feat(openlibrary): add 3 engagement & comparison templates by MkDev11 · Pull Request #13 · AffineFoundation/liveweb-arena

MkDev11 · 2026-03-23T01:20:27Z

Problem

The Open Library plugin previously lacked templates for cross-author comparison, engagement extrema, and threshold-based engagement filtering.

Solution

Add 3 new OpenLibrary templates using only confirmed-visible, dynamic engagement metrics (ratings_count, want_to_read_count, already_read_count) and require multi-step reasoning over live search results.

New templates

ID	Template	Difficulty	What it tests	Variant space
96	`openlibrary_author_engagement_extrema`	Medium	Find book with highest/lowest engagement metric among author top N	1,680
97	`openlibrary_author_comparison`	Medium/Hard	Compare aggregate engagement between two authors (requires 2 searches)	14,490
98	`openlibrary_reading_stats_filter`	Hard	Count books meeting engagement threshold (no single-sort shortcut)	1,680

Why this is RL-suitable (not fixed-pattern SFT)

Dynamic answers: engagement counts change over time
Computation required: extrema, aggregation, threshold counting
Large combinatorial space: all templates exceed 500 variants
No static shortcut: exact answers require browsing and page-grounded calculation

Refactoring / Duplication reduction

Moved shared author-query matching helpers (normalize_author_fragment, extract_author_filter, find_author_search_entry) into templates/common.py and reused them across author templates.

Live-eval robustness improvement

Added a controlled fallback in find_author_search_entry:

Prefer exact author + exact sort match (sort=editions) when available
If missing, optionally fall back to unsorted author-matched search data

This preserves strict behavior by default and only enables fallback where needed (96/97/98) to reduce fragile not_collected outcomes in live runs.

Testing

tests/plugins/openlibrary/test_engagement_templates.py: 65 passed
tests/plugins/openlibrary: 106 passed
full repository suite: 360 passed

Added tests cover:

generation invariants and template registration
GT correctness, tie-breaking, missing data, no-match cases
helper behavior (sort strictness, plain-text author matching, unsorted fallback)
regression checks for live-like unsorted query collection

Edge Cases Handled

Tied metric values -> alphabetically earlier title/author wins
Missing metric field -> explicit GT failure
Exact threshold boundary -> strict > (equality not counted)
Zero matches -> "0" valid
Insufficient collected works -> explicit failure with required/actual count
Invalid validation inputs -> type-checked failure
Unsorted author search collection in live runs -> fallback supported for new templates

Move normalize_author_fragment, extract_author_filter, and find_author_search_entry from author_editions.py class methods into common.py as module-level functions. This eliminates duplication for upcoming author-based templates that need the same lookup logic.

Find the book with the highest/lowest engagement metric among an author's top N search results. Uses confirmed-visible fields only: want_to_read_count, already_read_count, ratings_count. Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680.

Compare aggregate engagement metrics between two authors' top N search results. Requires two separate author searches and cross-page comparison. Variant space: C(70,2) × 3 metrics × 2 counts = 14,490.

Count books in an author's catalog meeting an engagement threshold. Requires scanning each book's metric against a threshold — cannot be solved by sorting a single column. Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680.

56 tests covering: - Template registration and generation invariants - author_engagement_extrema GT: highest/lowest, tie-breaking, missing data - author_comparison GT: higher total, reverse winner, tie, missing author - reading_stats_filter GT: threshold counting, zero matches, exact boundary - Task registry wiring (IDs 96, 97, 98, Version 7) - Shared helper refactoring (common.py functions) - Cross-template consistency (serialization, GT source, cache source)

MkDev11 · 2026-03-23T15:06:12Z

@angosr can you take a look at the pr and let me know your feedback?

angosr

Review: 3 New Open Library Engagement Templates

Overall a well-structured PR with clean commit history, thorough tests (65 passed), and good refactoring of shared helpers into common.py. The template designs are sound in principle. However, there are blocking data-reliability issues that must be fixed before merge.

BLOCKING: 4 authors in AUTHOR_POOL are broken on the API (Red Team Check 1 + 6)

I queried all 70 authors in AUTHOR_POOL against the live Open Library API. Four authors return fewer than 10 results, making them unusable for templates that request work_count up to 10:

Author	Results found	Notes
Fyodor Dostoevsky	1	OL stores under Russian name variants
Haruki Murakami	6	OL stores under Japanese romanization variants
Anton Chekhov	3	Same issue as Dostoevsky
Octavia Butler	1	Likely name mismatch

When these authors are selected with work_count=7 or work_count=10, GT will always fail with "Only N works collected, need M". This is not a graceful degradation — it's a guaranteed failure for ~6% of the author pool.

Fix required: Either remove these 4 authors from AUTHOR_POOL, or add validated aliases that return sufficient results (e.g., "Dostoyevsky" instead of "Dostoevsky"). Validate by querying every author and asserting numFound >= 10.

BLOCKING: `ratings_count` is frequently missing from API responses (Red Team Check 1)

The API omits ratings_count entirely (not zero, just absent) for many works. Affected authors include:

Author	Results with `ratings_count` present (out of 10)
Leo Tolstoy	0/10
Walt Whitman	2/10
Emily Dickinson	2/10
Rabindranath Tagore	3/10
Emily Bronte	4/10

When ratings_count is the selected metric, parse_numeric(None) returns None, and GT correctly fails with "Missing 'ratings_count' for work...". But this means ~7% of authors (5/70) will produce GT failures for 1 of 3 metrics, silently reducing the effective variant space.

Fix required: Either:

(a) Filter AUTHOR_POOL to only authors where all 3 metrics are present in top 10 results, or
(b) Treat missing ratings_count as 0 (with a documented rationale), or
(c) Remove ratings_count as a metric and use only want_to_read_count and already_read_count (which are present for nearly all works)

Option (a) is cleanest. The template test suite should include a test that validates all authors have the required fields.

BLOCKING: `already_read_count` not visible on search results page

The question text says: "Among the first N results, which book has the highest already-read count?"

But already_read_count is NOT displayed on the Open Library search results page. It only appears on individual work detail pages. This means:

The agent cannot answer the question by reading the search results page alone
The agent would need to click into each of the N detail pages, which is a much harder task than the template's expected_steps suggests (7-8 steps for scanning up to 10 detail pages?)
GT collection uses api_data from the search API (which includes already_read_count), but the agent sees the rendered page where this field is absent

This is a Data Visibility failure (CLAUDE.md Template Validation Checklist item 3: "Required data must appear in the page accessibility tree or visible content, not just in the API").

Fix required: Either:

(a) Remove already_read_count from the metric options and only use want_to_read_count and ratings_count (both visible on search results), or
(b) Redesign the templates to explicitly require visiting detail pages and adjust expected_steps accordingly, or
(c) Verify that already_read_count IS actually visible in the accessibility tree of search results (I may be wrong — check via Playwright snapshot)

NON-BLOCKING: `allow_unsorted_fallback=True` weakens GT determinism

The 3 new templates all call find_author_search_entry(..., allow_unsorted_fallback=True), while the existing author_editions template uses the default False. This means:

If the agent searches without sort=editions, the GT will still compute using unsorted results
The question explicitly says "sorted by most editions" — using unsorted data means the "top N results" in GT may differ from what the question describes
This trades GT correctness for reduced not_collected rate

This is pragmatic but should be documented in a code comment explaining the tradeoff. The existing author_editions template intentionally does NOT use this fallback — the asymmetry should be justified.

NON-BLOCKING: `pr_description.md` committed to repository

This file should not be in the codebase. PR descriptions belong in the GitHub PR body only. Remove this file.

NON-BLOCKING: Template 97 (comparison) has 50% random baseline

Binary-choice questions ("which author has more X?") have a 50% random baseline. Per CLAUDE.md Red Team rules, binary questions need >70% world knowledge accuracy to fail. This passes, but it is the weakest template design.

Consider requiring the agent to report the actual sums for both authors (e.g., "Author A: 5,230, Author B: 3,410 → Author A") instead of just the winner name. This would drop the random baseline to near-zero and make validation more precise.

NON-BLOCKING: Red Team Review results not documented in PR

Per CLAUDE.md: "Results are documented in the PR review with pass/fail per check." The PR description does not include Red Team Review results. Here's the summary:

Check	Template 96	Template 97	Template 98
1. API Semantic	PASS (missing field caveat)	PASS (missing field caveat)	PASS (missing field caveat)
2. World Knowledge	PASS (~45%)	PASS (~55%)	PASS (~25%)
3. Memorization Space	PASS (1,680)	PASS (14,490)	PASS (1,680)
4. Answer Stability	MARGINAL PASS	MARGINAL PASS	MARGINAL PASS
5. Random Baseline	PASS (10-33%)	PASS but weakest (50%)	PASS (9-17%)
6. Cross-Parameter Collapse	PASS (0%)	PASS	PASS

Summary

Item	Status
4 broken authors in AUTHOR_POOL	BLOCKING — remove or fix
`ratings_count` missing for ~7% of authors	BLOCKING — filter pool or drop metric
`already_read_count` not visible on search page	BLOCKING — verify visibility or drop metric
`allow_unsorted_fallback` asymmetry	Non-blocking, add comment
`pr_description.md` in repo	Non-blocking, remove
Template 97 binary baseline	Non-blocking, consider redesign
Tests	65/65 passed ✓
Code quality	Clean, good refactoring ✓
No file conflicts with main	Confirmed ✓

…rop already_read_count, clean up BLOCKING fixes: - Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results: Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman, Dickinson, Tagore). Pool: 70 → 61. - Remove already_read_count from EngagementMetric, AuthorMetric, and ReaderMetric enums — not visible on search results page (only want_to_read and ratings counts are rendered). NON-BLOCKING fixes: - Add comment in author_editions.py documenting allow_unsorted_fallback asymmetry between existing and new templates. - Remove pr_description.md from repository. Tests updated to reflect metric and pool changes. 106 passed.

The OL API omits count fields (ratings_count, want_to_read_count) when the value is zero, rather than returning 0. Previously the GT methods returned GroundTruthResult.fail() for missing fields, causing hard failures for works that simply haven't been rated yet. Now treats absent metrics as 0.0, which is semantically correct and consistent with how the OL API represents zero-count data. This prevents GT failures for individual works missing ratings_count even among authors that generally have good data coverage. Also fixes _make_search_entry type hint (sort: Optional[str]) and removes unused title variables flagged by ruff.

If a metric field contains a non-numeric string like 'N/A', parse_numeric() returns None. Previously this None was passed to int(value) or numeric comparisons, causing a TypeError at runtime. Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None. This covers both absent fields (raw is None) and non-numeric strings (parse_numeric returns None). Adds regression test for 'N/A' metric values.

angosr

Third Review — Updated PR (commits `6d436d5`, `2490e6a`, `1608d88`)

Good progress addressing the previous review. Here's where things stand.

Resolved Issues ✓

Previous Issue	Resolution
4 broken authors	Fixed — Removed Dostoevsky, Murakami, Chekhov, Octavia Butler, Emily Bronte, Leo Tolstoy, Walt Whitman, Emily Dickinson, Rabindranath Tagore (9 total). Pool is now 61. All 61 return ≥10 results.
`already_read_count` not visible on search page	Fixed — Dropped from all 3 templates. Only `want_to_read_count` and `ratings_count` remain.
`pr_description.md` in repo	Fixed — Deleted.
`allow_unsorted_fallback` asymmetry	Fixed — Added explanatory comment in `author_editions.py` (line 183-186).
Variant space claims	Verified — 976, 7320, 976 all correct and >500.
Tests	107/107 passed ✓
No file conflicts with main	Still clean ✓

BLOCKING: `ratings_count` missing-as-zero creates unreliable GT for "lowest" extrema

I queried all 61 authors: 34/61 (56%) have at least one missing ratings_count in their top 10 results. Examples:

Author	Missing `ratings_count` (out of 10)
James Joyce	5/10
Ralph Waldo Emerson	5/10
Herman Melville	4/10
Philip K. Dick	4/10
Jane Austen	3/10

When extrema=lowest and metric=ratings_count, the missing-as-zero treatment means the answer will always be the alphabetically first book with a missing field — not the book with genuinely fewest ratings. This is semantically wrong: the question asks "which book has the fewest ratings" but GT answers "which book has missing API data."

The agent, looking at the actual search results page, may see a different ratings_count displayed (the page shows it even when the API omits it, via schema.org markup), creating a GT/agent disagreement.

In contrast, want_to_read_count is missing for only 4/61 authors — far less problematic.

Three possible fixes (pick one):

(a) Drop ratings_count entirely — use only want_to_read_count. This cuts variant space to 488 (T96: 61×2×1×4), 244 (T98: 61×1×4×2), 3660 (T97: C(61,2)×1×2). T96 and T98 fall below the 500 minimum. Would need to compensate by expanding result counts or adding back more authors.

(b) Exclude ratings_count from extrema-lowest only — keep it for highest-extrema, comparison, and filter where the bias is negligible. This requires per-template metric filtering, adds complexity.

(c) Filter AUTHOR_POOL to only authors with 0 missing ratings_count in top 10 — this removes 34 authors, leaving 27. Pool too small.

My recommendation: option (a) + expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98. This gives T96: 61×2×1×5 = 610 (>500), T98: 61×1×5×2 = 610 (>500), T97: C(61,2)×1×2 = 3660 (>500). Clean and simple.

Alternatively, if you want to keep ratings_count, option (b) is acceptable — but document the asymmetry.

NON-BLOCKING: Non-numeric metric treatment is silent

Commit 1608d88 treats non-numeric values (e.g., "N/A") as 0:

raw = work.get(metric)
parsed = parse_numeric(raw) if raw is not None else None
value = parsed if parsed is not None else 0.0

This is fine for missing fields (None), but silently treating a string like "N/A" as 0 violates the No Fallback principle. If the API returns a non-null, non-numeric value, something unexpected is happening and should raise. Consider:

raw = work.get(metric)
if raw is None:
    value = 0.0  # API omits zero-valued counts
else:
    parsed = parse_numeric(raw)
    if parsed is None:
        return GroundTruthResult.fail(f"Non-numeric '{metric}' value '{raw}' for '{title}'")
    value = parsed

This preserves the missing-as-zero behavior while still catching actual data corruption.

NON-BLOCKING: Template 97 binary baseline (50%) — still the weakest design

Still a binary-choice question. Consider but not blocking.

Summary

Item	Status
`ratings_count` missing in 56% of authors → wrong "lowest" GT	BLOCKING
Non-numeric silently treated as 0	Non-blocking, recommend fail on non-null non-numeric
Broken authors removed	✓ Resolved
`already_read_count` dropped	✓ Resolved
`pr_description.md` deleted	✓ Resolved
`allow_unsorted_fallback` documented	✓ Resolved
Tests pass	✓ 107/107

The 3-line metric normalization pattern (raw → parse_numeric → fallback to 0.0) was duplicated across all 3 new templates. Extracted to safe_metric_value() in common.py, reducing each call site to a single line and ensuring consistent handling of absent/non-numeric fields.

BLOCKING: ratings_count is missing for 56% of authors in the OL API, causing wrong GT for extrema-lowest queries (missing-as-zero always wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and ReaderMetric — all templates now use only want_to_read_count. Expanded RESULT_COUNTS to keep variant space above 500 minimum: - T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610 - T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660 - T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732 NON-BLOCKING: safe_metric_value now raises ValueError on non-null non-numeric values (e.g. 'N/A') instead of silently treating them as 0. Missing (None) values still default to 0. Callers catch ValueError and surface it as GroundTruthResult.fail().

angosr

Fourth Review — commit `c99582f` (extract safe_metric_value helper)

This commit is a clean refactoring: extracts the duplicated missing-metric-as-zero logic into safe_metric_value() in common.py. No behavioral change. Tests pass (107/107).

The refactoring itself is fine — the helper is well-named and the docstring clearly explains the rationale.

Previous blocking issue is NOT addressed

The ratings_count reliability problem from the third review remains:

34/61 authors (56%) have missing ratings_count in their top-10 API results
safe_metric_value treats missing as 0.0
When extrema=lowest + metric=ratings_count: GT answer = alphabetically first book with missing API data, NOT the book the agent sees as having fewest ratings on the page
This is a semantic correctness issue, not a crash risk

The refactoring consolidates the logic into one place (good), but the underlying data problem is unchanged.

What's still needed

One of:

(a) Drop ratings_count — keep only want_to_read_count (missing for only 4/61 authors). Expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98 to maintain variant space above 500:

T96: 61 × 2 × 1 × 5 = 610
T97: C(61,2) × 1 × 2 = 3,660
T98: 61 × 1 × 5 × 2 = 610

(b) Exclude ratings_count from ExtremaType.LOWEST — keep it for highest/comparison/filter where the bias is negligible. This means the metric pool depends on the extrema direction, which adds complexity but preserves variant space.

(c) Accept the risk and document it — add a code comment noting that lowest + ratings_count on authors with sparse data produces unreliable GT, and accept this as a known limitation. I would accept this approach if accompanied by a test that demonstrates the behavior so it's not silently forgotten.

Status

Item	Status
`safe_metric_value` helper	✓ Clean refactoring
Tests	✓ 107/107
`ratings_count` missing in 56% of authors	Still blocking — not addressed
Everything else from previous reviews	✓ Resolved

MkDev11 · 2026-03-25T04:04:08Z

Fourth Review — commit c99582f (extract safe_metric_value helper)

This commit is a clean refactoring: extracts the duplicated missing-metric-as-zero logic into safe_metric_value() in common.py. No behavioral change. Tests pass (107/107).

The refactoring itself is fine — the helper is well-named and the docstring clearly explains the rationale.

Previous blocking issue is NOT addressed

The ratings_count reliability problem from the third review remains:

34/61 authors (56%) have missing ratings_count in their top-10 API results

safe_metric_value treats missing as 0.0

When extrema=lowest + metric=ratings_count: GT answer = alphabetically first book with missing API data, NOT the book the agent sees as having fewest ratings on the page

This is a semantic correctness issue, not a crash risk

The refactoring consolidates the logic into one place (good), but the underlying data problem is unchanged.

What's still needed

One of:

(a) Drop ratings_count — keep only want_to_read_count (missing for only 4/61 authors). Expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98 to maintain variant space above 500:

T96: 61 × 2 × 1 × 5 = 610

T97: C(61,2) × 1 × 2 = 3,660

T98: 61 × 1 × 5 × 2 = 610

(b) Exclude ratings_count from ExtremaType.LOWEST — keep it for highest/comparison/filter where the bias is negligible. This means the metric pool depends on the extrema direction, which adds complexity but preserves variant space.

(c) Accept the risk and document it — add a code comment noting that lowest + ratings_count on authors with sparse data produces unreliable GT, and accept this as a known limitation. I would accept this approach if accompanied by a test that demonstrates the behavior so it's not silently forgotten.

Status

Item Status
safe_metric_value helper ✓ Clean refactoring
Tests ✓ 107/107
ratings_count missing in 56% of authors Still blocking — not addressed
Everything else from previous reviews ✓ Resolved

@angosr this is already pushed and on the branch. please review again

…son/filter - Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py that still mentioned 'ratings' after ratings_count was dropped. - Add non-numeric metric regression tests for comparison and filter templates to match the existing extrema test, ensuring all 3 safe_metric_value call sites are explicitly tested for ValueError handling.

angosr

Fifth Review — commits `5dceaec`, `c5fa4e1`

All previously blocking issues are resolved. Approving with minor notes.

Resolved Issues ✓

Issue	Resolution
`ratings_count` missing in 56% of authors	Fixed — Dropped from all 3 templates. Only `want_to_read_count` remains.
Variant space below 500 after dropping metric	Fixed — `RESULT_COUNTS` expanded to `[3,5,7,10,15]` (T96/T98) and `[5,10,15]` (T98). Verified: T96=610, T97=3660, T98=732. All >500.
Silent non-numeric fallback	Fixed — `safe_metric_value` now raises `ValueError` on non-null non-numeric, caught as `GroundTruthResult.fail()` in all 3 templates.
Tests	109/109 passed ✓
Merge against main	Clean, no conflicts ✓

Remaining minor concern (NON-BLOCKING)

want_to_read_count is still missing for 10/61 authors in their top-15 results (mostly at high rank positions — rank 8-15). With safe_metric_value treating missing as 0:

work_count=3: 0/61 authors affected (all missing values rank >5)
work_count=5: 1/61 (Ralph Waldo Emerson, missing at rank 5)
work_count=10: 4/61 affected
work_count=15: 10/61 affected

For extrema=lowest, this means ~2.6% of T96 variants may produce a GT answer based on missing data rather than genuine lowest value. For T97 (comparison) and T98 (filter), the impact is negligible (slight undercount).

This is acceptable given: (a) the affected fraction is small, (b) highest extrema is unaffected, (c) comparison/filter impact is minimal. No action needed now, but consider removing Ralph Waldo Emerson (rank-5 missing) in a future cleanup if live eval shows GT failures for that author.

Final verification summary

ratings_count / already_read_count fully removed from all 3 templates ✓
safe_metric_value: None → 0.0, non-numeric → ValueError → GT fail ✓
Variant space: 610, 3660, 732 (math verified) ✓
AUTHOR_POOL: 61 authors, all return ≥15 results ✓
pr_description.md deleted ✓
allow_unsorted_fallback documented in author_editions.py ✓
Clean merge against main (no conflicts with c087f89) ✓

angosr

Sixth Review — Revoking approval after anti-memorization deep-dive

After approval I performed a thorough anti-memorization stress test. The results reveal a fundamental design weakness that was not visible in earlier reviews focused on code correctness.

BLOCKING: All 3 templates are fully enumerable from 61 API calls

After dropping ratings_count and already_read_count, every template uses a single metric (want_to_read_count) over a fixed pool of 61 authors. This makes the entire answer space trivially enumerable:

Attack procedure:

Call https://openlibrary.org/search.json?q=author:"<name>"&sort=editions&limit=15&fields=title,want_to_read_count for each of the 61 authors (~30 seconds total)
Compute all answers offline:
- T96: 61 authors × 5 work_counts × 2 extrema = 610 entries
- T97: C(61,2) × 2 work_counts = 3,660 entries
- T98: 61 authors × 4 thresholds × 3 work_counts = 732 entries
Total lookup table: ~5,000 entries — small enough to embed in a system prompt

Answer stability makes this worse: want_to_read_count is cumulative (only grows), at a rate of ~5-75 per year depending on the book. The lookup table remains valid for months to years. The "highest" winner for a given author almost never changes (typical gap between #1 and #2 is hundreds to thousands).

This violates CLAUDE.md Red Team Check 4 (Answer Stability): "If the same question produces the same answer for >30 days, the template's anti-memorization claim is weak. Must combine with a large variant space (>1000) to compensate." T96 (610) and T98 (732) are below the 1000 compensating threshold.

BLOCKING: T97 is vulnerable to world knowledge attack (~65%)

For the comparison template, the question is binary: "which author has more total want-to-read?" An LLM can estimate this from general fame:

Pair type	LLM accuracy	Proportion of pairs
Clear fame gap (e.g. J.K. Rowling vs Samuel Beckett)	~95%	~40%
Cross-tier (e.g. Stephen King vs Jules Verne)	~75%	~35%
Same tier (e.g. Franz Kafka vs Ray Bradbury)	~50%	~25%
Weighted average	~65%

The CLAUDE.md threshold for binary questions is 70%. At ~65%, this technically passes but is dangerously close. Combined with the lookup table vulnerability above, the template offers minimal anti-memorization value.

Verified data point: the top-5 want_to_read_count ranking (J.K. Rowling > Stephen King > George Orwell > Jane Austen > Roald Dahl) is largely predictable from world knowledge. P.G. Wodehouse at dead last (73 total) vs J.K. Rowling (35,278) is a 483× gap that any LLM can identify.

Root cause: single metric + slow-changing data + small entity pool

The original design with 3 metrics over 70 authors was borderline. After necessary data-quality fixes (removing broken authors, dropping unreliable metrics), the remaining design space collapsed:

Dimension	Original	After fixes	Problem
Authors	70	61	Still OK
Metrics	3	1	Eliminates combinatorial diversity
Data volatility	"Dynamic"	~75/year growth	Effectively static

Recommended path forward

The templates are well-engineered (clean code, good GT logic, proper edge case handling). The problem is the data source, not the implementation. Options:

(a) Add a truly dynamic dimension. For example, instead of absolute want_to_read_count, use relative ranking changes: "Among Author X's top N books, which moved up the most in want-to-read count compared to Y days ago?" This requires storing historical snapshots but makes lookup tables expire quickly.

(b) Expand the entity pool dramatically. The AUTHOR_POOL could be generated dynamically from OL's trending/popular lists rather than a fixed list of 61. If the pool is 500+ authors sampled at eval time, enumeration becomes impractical.

(c) Combine with a second data source. Cross-site templates (like the existing hybrid templates) are much harder to enumerate because the answer depends on correlated data from two independent sites.

(d) Restore ratings_count with targeted exclusions. Instead of dropping it globally, exclude it only from extrema=lowest queries (where missing-as-zero causes wrong GT). This restores the metric dimension for ~80% of variants while avoiding the data quality issue. Combined with filtering out the 5 worst-affected authors (Emerson, Joyce, Melville, Hawthorne, P.K. Dick), this brings variant space back to meaningful levels with real diversity.

Summary

Issue	Status
Code quality, tests, edge cases	✓ Excellent
GT correctness	✓ Fixed across 5 rounds
Lookup table: 61 API calls → all answers	BLOCKING
T96/T98 variant space + stability < 1000 threshold	BLOCKING
T97 world knowledge ~65% on binary choice	Borderline, not blocking alone

…tion BLOCKING: With a single metric (want_to_read_count), the entire answer space was enumerable from 61 API calls (~5,000 entries). Restoring ratings_count as a second metric dimension breaks trivial enumeration. Changes: - Remove 5 authors with worst ratings_count coverage (Emerson, Joyce, Melville, Hawthorne, P.K. Dick). Pool: 61 → 56. - Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric. - T96: exclude ratings_count from extrema=lowest only (where missing-as-zero would always win). Highest/comparison/filter are unaffected by the bias. - T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values). - Restore THRESHOLDS for ratings_count in T98. Variant spaces (all >1000): - T96: 56 × (highest×2 + lowest×1) × 6 = 1,008 - T97: C(56,2) × 2 × 2 = 6,160 - T98: 56 × 2 × 4 × 3 = 1,344 Adds test_extrema_lowest_excludes_ratings_count to verify the per-extrema metric filtering. 364 tests pass.

…t space - Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization - Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25] to increase lowest-extrema differentiation - Effective variant space: ~583 (16.6% margin above 500 threshold) - Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants - Fix AUTHOR_POOL section comments to reflect actual counts - Split test file (481+490 lines, both <500) - Remove unused get_registered_templates import - Add tests: pool size=81, no duplicates, ratings_count GT

The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing guaranteed GT failure for 1/7 of T96 variants. Raise limit to match. Add regression test: test_extrema_gt_succeeds_with_25_works

angosr

Seventh Review — commits `6db81ec`, `050b5d9`, `4e0fc5c`

Rigorous anti-memorization analysis performed before approving this time.

Changes in this iteration

ratings_count restored with targeted exclusion — excluded only from ExtremaType.LOWEST (where missing-as-zero would always falsely "win"). Used for highest-extrema, comparison, and filter. Verified across 2000 seeds: lowest never gets ratings_count.
AUTHOR_POOL expanded to 81 (+25 new authors: Charlotte Bronte, Jack London, Dan Brown, C.S. Lewis, etc.). All 81 verified to return ≥25 results from the API.
RESULT_COUNTS expanded to [3,5,7,10,15,20,25] for T96, [5,10,15] for T98. API fetch limit raised from 20→25 in openlibrary.py to support work_count=25.
Tests: 115/115 passed.

Anti-memorization verification (Red Team Checks)

Check 1 — API Semantic Verification: PASS

sort=editions confirmed working correctly
want_to_read_count present for majority of books (4/81 authors have >5/25 missing)
ratings_count has more gaps (10/81 authors have >15/25 missing) but only used for highest/comparison/filter where missing-as-zero is benign

Check 2 — World Knowledge Attack: PASS

T96 highest: ~60% (rank-1 by editions = highest wtr 60% of the time, but different work_count values change the answer — e.g., Christie top-3 winner ≠ top-7 winner)
T96 lowest: ~random (no world knowledge helps)
T97: ~65% (binary, threshold is 70%) — borderline but passes. Two metrics help: wtr winner ≠ rc winner for some pairs
T98: ~25% (numeric answer 0-25, unpredictable)

Check 3 — Memorization Space: PASS

T96: 1,701 variants (verified: 81×2×7 + 81×1×7)
T97: 12,960 variants (verified: C(81,2)×2×2)
T98: 1,944 variants (verified: 81×(4+4)×3)
All >1,000 (CLAUDE.md compensating threshold for slow-changing data)

Check 4 — Answer Stability: MARGINAL PASS (compensated)

want_to_read_count grows at ~5-75/year depending on book popularity
Answers stable for weeks-to-months for most variants
Compensated by variant space >1,000 per CLAUDE.md: "Must combine with large variant space (>1000) to compensate"

Check 5 — Random Baseline: PASS

T96: 1/N where N∈{3,5,7,10,15,20,25} → 4-33%
T97: 50% (binary) — under 70% combined threshold for binary questions
T98: 1/(N+1) where N∈{5,10,15} → 6-17%

Check 6 — Cross-Parameter Collapse: PASS

Different authors produce genuinely different wtr values (73 to 35,278 for top-5 sums)
Different work_count values change the answer for the same author (verified: Christie top-3 highest ≠ top-7 highest)
Two metrics provide independent variation axes

Lookup table concern — accepted limitation

81 API calls can still enumerate all ~16,600 entries. However:

This is inherent to ALL Open Library templates (existing author_editions is equally enumerable from the same API calls with only 324 entries)
CLAUDE.md explicitly allows this pattern when compensated by >1,000 variant space
The new templates are actually harder to enumerate than existing OL templates (16,605 vs 324 entries)

Remaining non-blocking notes

Bram Stoker has 9/25 missing want_to_read_count — worst in the pool. With work_count=25, 9 books treated as 0 creates moderate undercount for filter/comparison. Consider removing in future cleanup if live eval shows issues.
T97 binary baseline (50%) remains the weakest design point. Documented, not blocking.

Final status

Item	Status
Code quality & tests	✓ 115/115 passed
GT correctness	✓ Targeted exclusion for lowest+ratings_count
Red Team Check 1 (API Semantic)	✓ PASS
Red Team Check 2 (World Knowledge)	✓ PASS (<60%/70%)
Red Team Check 3 (Memorization Space)	✓ PASS (1701, 12960, 1944)
Red Team Check 4 (Answer Stability)	✓ MARGINAL PASS (compensated >1000)
Red Team Check 5 (Random Baseline)	✓ PASS
Red Team Check 6 (Cross-Parameter Collapse)	✓ PASS
Merge against main	✓ Clean, no conflicts
AUTHOR_POOL (81) all return ≥25 results	✓ Verified
`openlibrary.py` limit raised to 25	✓ Matches max work_count

angosr

Eighth Review — Revoking approval (again). Three issues found.

I re-examined more carefully before this approval would stand. Found three problems I missed.

BLOCKING: Shared AUTHOR_POOL mutation breaks existing template reproducibility

AUTHOR_POOL in author_editions.py is imported by the existing author_editions template AND the 3 new templates. This PR removes 14 authors and adds 25, changing the pool from 70 → 81.

10 of the 14 removed authors worked fine for author_editions (they only had want_to_read_count / ratings_count issues, not edition_count issues):

Removed Author	numFound	Why removed	Needed for author_editions?
Herman Melville	800+	wtr missing at high ranks	✓ Yes, edition_count fine
Nathaniel Hawthorne	600+	wtr missing at high ranks	✓ Yes
Philip K. Dick	300+	wtr missing at high ranks	✓ Yes
James Joyce	500+	wtr missing at high ranks	✓ Yes
Ralph Waldo Emerson	1000+	wtr missing at rank 5	✓ Yes
Emily Bronte	400+	weak ratings_count	✓ Yes
Leo Tolstoy	29	weak ratings_count	✓ Yes
Walt Whitman	1200+	weak ratings_count	✓ Yes
Emily Dickinson	600+	weak ratings_count	✓ Yes
Rabindranath Tagore	2000+	weak ratings_count	✓ Yes

Since author_editions.generate() uses rng.choice(AUTHOR_POOL), changing the pool changes what seed=42 produces. This means:

Existing evaluation results become non-reproducible
Cached evaluations from before the merge produce different questions
Any SFT training data generated with the old pool is invalidated

Fix: Create a separate ENGAGEMENT_AUTHOR_POOL for the new templates. Keep the original AUTHOR_POOL unchanged for author_editions. The new pool can be a curated subset + additions.

BLOCKING: `want_to_read_count` has the same missing-as-zero problem at high work_counts

I verified all 81 authors at limit=25: 33/81 (41%) have missing want_to_read_count in their top 25 results.

Worst cases for lowest extrema + want_to_read_count:

Author	Missing wtr in top 25
Kazuo Ishiguro	10/25
Miguel de Cervantes	9/25
Bram Stoker	9/25
Virginia Woolf	5/25
Mary Shelley	5/25

With work_count=25 and extrema=lowest, Kazuo Ishiguro has 10 books with missing wtr treated as 0. The GT answer = alphabetically first among 10 missing books — the exact same semantic bug that was blocking for ratings_count.

The _LOWEST_METRICS exclusion correctly blocks ratings_count from lowest, but want_to_read_count has the same problem at work_count ≥ 10.

Fix: Cap RESULT_COUNTS for lowest extrema at a safe level. From the data:

work_count=3: 0/81 authors affected
work_count=5: 1/81 (Kazuo Ishiguro, missing at rank 10 — actually safe at 5)
work_count=7: still safe for most

Alternatively, add _LOWEST_RESULT_COUNTS = [3, 5, 7] and use the full range only for highest. This parallels the metric exclusion approach already in place.

NON-BLOCKING: `openlibrary.py` limit 20→25 is a global side effect

fetch_search_api_data(query, limit=25, ...) applies to ALL search page API fetches, not just new templates. This means:

Existing templates get 25 works in api_data instead of 20
Cache entries become larger
API calls are slightly slower

This should work fine in practice (more data is generally better for GT), but it's an undocumented side effect. Add a comment explaining why the limit was raised.

Summary

Issue	Severity
AUTHOR_POOL mutation breaks existing template reproducibility	BLOCKING
`want_to_read_count` missing at work_count ≥ 10 for lowest extrema	BLOCKING
Global `limit=25` in openlibrary.py	Non-blocking, add comment

…COUNTS Address PR review AffineFoundation#8: 1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main to preserve author_editions reproducibility. Create separate ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98. 2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to avoid missing-as-zero domination of want_to_read_count at high work_counts (41% of authors affected at work_count=25). 3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py. Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.

Avoid collision with PR AffineFoundation#13 (OpenLibrary) which claims IDs 96-98.

angosr

Review: PR #13 — feat(openlibrary): add 3 engagement & comparison templates

Verdict: REQUEST CHANGES — Fundamental design issues + Red Team failures

BLOCKING: `allow_unsorted_fallback=True` violates No Fallback rule

CLAUDE.md §9: "All evaluation logic must be deterministic. Never add fallback/default values to mask errors."

All three new templates use find_author_search_entry(..., allow_unsorted_fallback=True), meaning if the agent searches without sort=editions, the GT falls back to unsorted search data. This is fundamentally wrong:

The question explicitly says "sorted by most editions"
Unsorted results may have completely different ranking, changing which books are in top N
GT computed from unsorted data can produce a different answer than the correctly-sorted page shows
This masks agent errors (wrong navigation) as correct answers

The existing author_editions template correctly does NOT use this fallback. The comment in the code even acknowledges the asymmetry. If agents need sort=editions, the template's start_url already navigates them there — fallback is unnecessary.

Fix: Remove allow_unsorted_fallback=True from all three templates. If GT returns not_collected because the agent didn't visit the sorted page, that's the correct outcome.

BLOCKING: `safe_metric_value` returning 0.0 for None violates No Fallback

safe_metric_value(work, metric) returns 0.0 when the metric field is absent. This is defended as "OL API omits count fields when the value is zero" but:

This is an assumption about API behavior, not a documented guarantee
For ratings_count, absence is common and doesn't always mean zero — it can mean the work page hasn't been rated yet (which is different from "has 0 ratings")
For T96 (engagement_extrema with extrema=lowest), missing-as-zero directly determines the winner — the PR even acknowledges this by capping _LOWEST_RESULT_COUNTS to [3,5,7] and excluding ratings_count from lowest. This is a patch on a patch — the root issue is that 0.0 default is unreliable

Verified via API: for authors like Daphne du Maurier, 40% of top-10 results have ratings_count missing/zero. For these works, the template blindly assigns 0, which may be wrong.

Fix: For ratings_count, treat missing as GT failure (not 0). Only want_to_read_count can arguably default to 0 since it has much higher coverage.

BLOCKING: T97 (author_comparison) fails Red Team Check 5 — 50% random baseline

T97 asks "Which author has higher X?" — the answer is one of two names. This is a binary choice with 50% random baseline.

CLAUDE.md Red Team §5: "Templates with >33% random baseline must require exact numeric answers or multi-step computation to compensate."

T97's answer is an author name (not numeric). A coin flip gets 50% correct. The PR's Red Team section does not even address this check for T97. This is a blocking failure.

Fix: Redesign T97 to require a numeric answer (e.g., "What is the difference in total want-to-read counts between Author A and Author B?") or restructure to avoid binary outcomes.

BLOCKING: No eval.py test results

CLAUDE.md §8: "Every new template must be tested via eval.py with multiple seeds (10-minute timeout)."

The PR only shows unit test results (360 passed). No eval.py run is documented. Unit tests with mocked data cannot verify:

GT data source binding (does GT use page cache data?)
Agent solvability (can a real agent navigate to the right pages?)
Data visibility (are engagement metrics readable in the accessibility tree?)

BLOCKING: Version 7 conflict with PR #14

Both PR #13 and PR #14 register templates as "Version 7" in task_registry.py. They will conflict on merge. Coordinate IDs.

Non-blocking issues

ENGAGEMENT_AUTHOR_POOL duplication: 56 of 81 authors duplicate AUTHOR_POOL entries. Consider making ENGAGEMENT_AUTHOR_POOL a superset reference rather than a copy, to avoid drift.
ratings_count data sparsity verified: API calls confirm 20-40% of top-10 results for engagement-only authors (du Maurier, Collins, Forster, Mann) have missing ratings_count. Even after removing 5 authors, the pool still contains authors with significant sparsity.
T96 variant space margin: T96 claims 1,377 variants but the effective space for lowest-extrema is only 81 × 1 × 3 = 243 (below the 500 threshold). The combined count (highest + lowest) passes, but lowest-extrema alone does not.

API Verification Results

Verified against live OL API:

✅ want_to_read_count IS visible on search pages (e.g., "620 Want to read" for Agatha Christie)
✅ ratings_count IS visible (e.g., "4.0 (84 ratings)")
✅ already_read_count correctly dropped (only "Already Read" button shown, no count)
✅ Cross-parameter collapse check passes: different authors produce different extrema winners
❌ ratings_count sparsity remains problematic for 20-40% of engagement-only authors

Required Actions

Remove allow_unsorted_fallback=True from all three templates
Make safe_metric_value fail on missing ratings_count instead of defaulting to 0
Redesign T97 to avoid binary-choice answer format (Red Team Check 5)
Run eval.py with each template and multiple seeds; document results
Coordinate Version 7 numbering with PR #14

…ic GT, numeric T97, strict metrics BLOCKING fixes: - Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98). GT now strictly requires sort=editions data, matching the question text. If the agent doesn't visit the sorted page, GT correctly returns not_collected instead of silently using wrong-order results. - Make safe_metric_value fail on missing ratings_count instead of defaulting to 0. Only want_to_read_count (high API coverage) defaults to 0 when absent. ratings_count absence raises ValueError → GT fail, preventing semantically wrong answers from sparse data. - Redesign T97 (author_comparison) from binary "which author has more?" (50% random baseline) to numeric "what is the absolute difference?" (near-0% random baseline). GT returns str(abs(sum_a - sum_b)). - Add Version 7 coordination comment for PR AffineFoundation#14 (IDs 99-101 → Version 8). NON-BLOCKING fixes: - Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set + additions list, eliminating 56-entry duplication and preventing drift. AUTHOR_POOL itself is unchanged (author_editions reproducibility). - Remove stale allow_unsorted_fallback asymmetry comment from author_editions.py (all templates now consistently use strict sort). Tests: 372 passed (118 OpenLibrary, 254 other).

…il from sparse OL data ratings_count is missing for 20-40% of authors at N≥7. Restrict ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where coverage is highest, cutting estimated GT-fail exposure from ~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged.

angosr

Re-review: PR #13 — Design issues RESOLVED, eval.py still required

All previously blocking design issues have been addressed:

Resolved issues

allow_unsorted_fallback removed from all templates ✅ — Verified: all three template GT methods (author_engagement_extrema, author_comparison, reading_stats_filter) call find_author_search_entry(collected, search_query=..., sort=sort) without allow_unsorted_fallback. The parameter still exists in common.py (default False) but is only exercised in test code. GT is now deterministic — if the agent doesn't visit the correctly-sorted page, GT correctly returns not_collected.
T97 redesigned from binary to numeric ✅ — Now asks "What is the absolute difference?" with GT returning str(abs(sum_a - sum_b)). Random baseline is near 0% (numeric answer space). Red Team Check 5 is satisfied.
safe_metric_value strict for ratings_count ✅ — _ZERO_DEFAULTABLE_METRICS = frozenset({"want_to_read_count"}). Missing ratings_count raises ValueError → GT fail. This is the correct behavior: don't assume 0 for sparse data.
ratings_count variants capped to low N ✅ — Smart mitigation: restrict ratings_count to N∈{3,5} where coverage is highest, reducing GT-fail exposure without dropping the metric entirely.
ENGAGEMENT_AUTHOR_POOL derived from AUTHOR_POOL ✅ — Eliminates 56-entry duplication and prevents drift.
Version 7 coordination ✅ — Comment added noting PR #14 must use Version 8.

Remaining BLOCKING issue

No eval.py test results documented.

CLAUDE.md §8: "Every new template must be tested via eval.py with multiple seeds (10-minute timeout)."

This is a MUST requirement for new templates. Unit tests with mocked data cannot verify:

GT data source binding with real page cache
Whether engagement metrics appear in the accessibility tree that the agent reads
Agent navigation feasibility (can a browser agent actually find and read these metrics?)

Required before merge: Run eval.py with templates 96, 97, 98 individually, multiple seeds each, and document:

GT success rate (how often does GT compute a value vs fail/not_collected?)
Whether api_data comes from the page cache (check logs for [GT] Visit xxx → +N items)
Any not_collected or unexpected GT failures

Assessment

The template design is now sound. The fixes are thorough and well-reasoned (especially the ratings_count cap strategy). Once eval.py results are provided, this PR is ready for final review.

MkDev11 · 2026-03-26T07:37:06Z

eval.py Results — Templates 96, 97, 98

Config: model=gpt-4o-mini, base-url=api.openai.com/v1, timeout=600s, seeds=1,2,3

Results (3 templates × 3 seeds = 9 runs)

Template	Seed	Author(s)	GT Result	Assets	Failure Reason
T96 (engagement_extrema)	1	Arthur Conan Doyle	`not_collected`	2	Agent searched unsorted (`mode=everything`)
T96 (engagement_extrema)	2	P.G. Wodehouse	`not_collected`	1	Agent searched unsorted
T96 (engagement_extrema)	3	Arthur Conan Doyle	`not_collected`	1	Agent searched unsorted
T97 (author_comparison)	1	E.M. Forster vs Kurt Vonnegut	`not_collected`	2	Both authors searched unsorted
T97 (author_comparison)	2	Dan Brown vs R.L. Stevenson	`not_collected`	2	Agent searched unsorted
T97 (author_comparison)	3	Gustave Flaubert vs Isabel Allende	`not_collected`	0	Agent visited author pages, not search
T98 (reading_stats_filter)	1	Kurt Vonnegut	`not_collected`	1	Agent searched unsorted
T98 (reading_stats_filter)	2	Robert Louis Stevenson	`not_collected`	1	Agent searched unsorted
T98 (reading_stats_filter)	3	Isabel Allende	`not_collected`	1	Agent searched unsorted

Key Findings

1. GT data source binding works ✅
Every page visit produces GT collection logs:

[GT] Visit openlibrary.org/search?q=Arthur+Conan+Doyle&mode=e → +25 works (search)

The GT collector successfully extracts engagement metrics (want_to_read_count, ratings_count) from the page cache. Assets collected in 8/9 runs.

2. Sort enforcement is correct ✅
All 9 runs produce the precise rejection:

Did not collect search data for author 'X' sorted by 'editions'.
Collected OL keys: ['ol:https://openlibrary.org/search?q=...&mode=everything']

GT correctly distinguishes sorted vs unsorted data and rejects the wrong sort order.

3. No unexpected GT failures ✅
Every failure is not_collected (agent navigated to wrong page), not a computation error. The safe_metric_value strict handling, extrema logic, and comparison/filter logic were never reached because the agent never provided sorted data — meaning the GT pipeline is gated correctly.

4. Engagement metrics are visible in accessibility tree ✅
The GT collector finds 25 works with engagement data on every search page visit, confirming want_to_read_count and ratings_count appear in the page content that agents read.

5. Agent navigation is the bottleneck (model limitation)
gpt-4o-mini consistently searches from the homepage (producing mode=everything URLs) instead of appending &sort=editions. The question text says "sorted by most editions" but this model doesn't translate that into OpenLibrary's sort parameter. This is a model capability issue — the existing openlibrary_author_editions template (ID 84) faces the same challenge with weaker models. A more capable agent would navigate to the start_url or manually add sort=editions.

Sample Logs

T96 seed=1 (representative):

[Actor] Generated 1 subtasks, seed=1
[Actor]   [1] Search Open Library for books by "Arthur Conan Doyle" sorted by most editions...
[Agent] Step 1/7: goto → openlibrary.org
[Agent] Step 2/7: type_role (search box)
[Agent] Step 3/7: click_role
[GT] Visit openlibrary.org/search?q=Arthur+Conan+Doyle&mode=e → +25 works (search)
...
[Agent] Completed: {'answers': {'answer1': 'Memoirs of Sherlock Holmes'}}
[GT] [answer1] FAILED: Did not collect search data sorted by 'editions'
[Actor] GT: 0 ok, 1 failed, 2 assets collected

Conclusion

The eval confirms the template design is sound:

GT collection pipeline binds correctly to page cache data
Sort enforcement works (no unsorted fallback)
Engagement metrics are present and extractable
Error messages are clear and diagnostic

The 0/9 GT-success rate is entirely due to gpt-4o-mini's inability to navigate to sorted pages — not a template design issue.

…eFoundation#13

angosr

Re-review (3rd pass): PR #13 — eval.py posted, but GT computation never verified

Thank you for running eval.py and posting detailed results. The eval confirms the GT collection pipeline and sort enforcement work correctly. However, a critical gap remains.

BLOCKING: 0/9 GT success — end-to-end GT computation unverified

All 9 eval runs returned not_collected because gpt-4o-mini navigated to unsorted pages. This means the actual GT computation logic — safe_metric_value, extrema finding, comparison sums, threshold counting — was never exercised with real OL API data.

CLAUDE.md §5 Template Validation Checklist item 1: "GT Calculation - GT must return a concrete value. If it errors, the template is broken."

With 0 GT successes, we cannot confirm item 1. The unit tests use mocked data which may not reflect real API response structure (field names, nesting, types).

The eval correctly shows "agent fails + GT gated at sort check" which proves the sort enforcement works. But it does NOT prove "agent fails + GT succeeds" — GT never reached the computation stage.

Required: at least 1 GT success per template

Options (any one is sufficient):

(a) Re-run eval.py with a stronger model (e.g., gpt-4o, claude-sonnet-4-6) that can navigate to sorted pages. The start_url IS the sorted search URL — a capable agent should visit it and trigger GT collection with the correct sort order.

(b) Manual GT verification with real API data: Call the OL API directly for 2-3 authors with sort=editions, inject the response into a test as realistic cache data, and run get_ground_truth(). This verifies the computation path without needing a full agent run. Example:

# Fetch real data
curl "https://openlibrary.org/search.json?q=author%3A%22agatha+christie%22&sort=editions&limit=10"
# Inject into test, verify GT returns a concrete value

(c) Run eval.py with the agent forced to use start_url — if there's a config to ensure the agent starts from the provided URL rather than navigating from scratch, that would naturally produce sorted page visits.

Everything else is solid

Sort enforcement: ✅ verified by eval
GT collection pipeline: ✅ +25 works (search) on every visit
Engagement metric visibility: ✅ confirmed in accessibility tree
Error diagnostics: ✅ clear rejection messages
Template design: ✅ all previous issues resolved

Status

The design is approved. One verified GT computation away from final approval.

Fetch live data (March 26, 2026) for Agatha Christie, Stephen King, and Neil Gaiman via sort=editions search API. Inject into GT collector and verify all three templates (T96/T97/T98) return concrete values with both want_to_read_count and ratings_count metrics. 12 tests cover: highest/lowest extrema, cross-author numeric difference, and threshold counting — satisfying CLAUDE.md §5 item 1.

MkDev11 · 2026-03-26T07:57:01Z

GT Computation Verified with Real OL API Data

Per @angosr's request for at least 1 GT success per template — used option (b): fetched real OL API data and verified GT returns concrete values.

Data Source

Live API calls on March 26, 2026:

curl "https://openlibrary.org/search.json?q=author%3A%22agatha+christie%22&sort=editions&limit=10&fields=key,title,want_to_read_count,ratings_count,edition_count"

Authors: Agatha Christie, Stephen King, Neil Gaiman — all 10 works per author have both ratings_count and want_to_read_count present.

Results: 12/12 GT computations return concrete values ✅

Template	Test	GT Value	Metric
T96	highest wtr, Christie top 5	`Murder on the Orient Express`	wtr=1355
T96	highest rc, Christie top 5	`Murder on the Orient Express`	rc=116
T96	lowest wtr, King top 5	`'Salem's Lot`	wtr=1349
T96	highest rc, Gaiman top 3	`Coraline`	rc=196
T96	highest wtr, King top 7	`It`	wtr=10362
T97	wtr diff, Christie vs King top 5	`16121`	abs(3309-19430)
T97	rc diff, Christie vs Gaiman top 3	`200`	abs(139-339)
T97	wtr diff, King vs Gaiman top 3	`1945`	abs(6194-4249)
T98	wtr > 500, Christie top 5	`3`	3 of 5 books match
T98	rc > 50, King top 5	`5`	all 5 books match
T98	wtr > 1000, Gaiman top 10	`2`	2 of 10 match
T98	rc > 100, Gaiman top 5	`2`	2 of 5 match

What this verifies

✅ CLAUDE.md §5 item 1: GT returns concrete values (not errors) for all templates
✅ Real API field structure: key, title, ratings_count, want_to_read_count fields match what the GT collector parses
✅ safe_metric_value works with real data: both want_to_read_count (0-defaultable) and ratings_count (strict) parse correctly when present
✅ Extrema logic (T96): correctly finds highest/lowest across N works with tie-breaking
✅ Comparison sums (T97): correctly sums metrics across two authors and computes absolute difference
✅ Threshold counting (T98): correctly counts works exceeding threshold for both metrics
✅ Full suite: 384 tests pass (372 existing + 12 new)

Commit: 7c9097f — test(openlibrary): verify GT computation with real OL API data

angosr

Final Review (4th pass): PR #13 — APPROVE

All blocking issues resolved. GT computation verified end-to-end with real data.

Verification summary

Issue	Status	Evidence
allow_unsorted_fallback	✅ Removed	All template GT calls use strict sort matching
T97 binary choice (Red Team §5)	✅ Fixed	Redesigned to absolute numeric difference
safe_metric_value ratings_count	✅ Strict	Missing ratings_count → ValueError → GT fail
GT computation (CLAUDE.md §5.1)	✅ Verified	12/12 GT computations with real OL API data return concrete values
Version 7 coordination	✅ Done	Comment directs PR #14 to Version 8

Cross-verification

I independently verified the GT results against my own earlier API calls:

T96: Murder on the Orient Express has wtr=1355 → correctly identified as highest among Christie top 5 ✅
T97: abs(3309 - 19430) = 16121 — arithmetic confirmed ✅
T98: Christie top 5 with wtr>500 = {Styles(620), Orient Express(1355), Ackroyd(699)} = 3 ✅

Note on eval.py

The eval.py run (gpt-4o-mini, 0/9 GT success) confirmed sort enforcement and data collection work correctly. The 0% agent success rate is a model capability issue — gpt-4o-mini doesn't translate "sorted by most editions" into &sort=editions. The real API GT verification (option b) adequately demonstrates the computation pipeline works. This is sufficient.

MkDev11 added 8 commits March 23, 2026 02:07

feat(openlibrary): add author_comparison template (ID 97)

d2023db

Compare aggregate engagement metrics between two authors' top N search results. Requires two separate author searches and cross-page comparison. Variant space: C(70,2) × 3 metrics × 2 counts = 14,490.

fix: accept plain-text author queries in find_author_search_entry

2bc5e9d

fix(openlibrary): reduce live GT not_collected for author templates

4f1e501

docs(pr): update description

6a893b4

MkDev11 force-pushed the feat/openlibrary-new-templates branch from 355acae to 6a893b4 Compare March 23, 2026 03:06