Skip to content

feat(openlibrary): add 3 engagement & comparison templates#13

Open
MkDev11 wants to merge 21 commits intoAffineFoundation:mainfrom
MkDev11:feat/openlibrary-new-templates
Open

feat(openlibrary): add 3 engagement & comparison templates#13
MkDev11 wants to merge 21 commits intoAffineFoundation:mainfrom
MkDev11:feat/openlibrary-new-templates

Conversation

@MkDev11
Copy link
Contributor

@MkDev11 MkDev11 commented Mar 23, 2026

Problem

The Open Library plugin previously lacked templates for cross-author comparison, engagement extrema, and threshold-based engagement filtering.

Solution

Add 3 new OpenLibrary templates using only confirmed-visible, dynamic engagement metrics (ratings_count, want_to_read_count, already_read_count) and require multi-step reasoning over live search results.

New templates

ID Template Difficulty What it tests Variant space
96 openlibrary_author_engagement_extrema Medium Find book with highest/lowest engagement metric among author top N 1,680
97 openlibrary_author_comparison Medium/Hard Compare aggregate engagement between two authors (requires 2 searches) 14,490
98 openlibrary_reading_stats_filter Hard Count books meeting engagement threshold (no single-sort shortcut) 1,680

Why this is RL-suitable (not fixed-pattern SFT)

  • Dynamic answers: engagement counts change over time
  • Computation required: extrema, aggregation, threshold counting
  • Large combinatorial space: all templates exceed 500 variants
  • No static shortcut: exact answers require browsing and page-grounded calculation

Refactoring / Duplication reduction

Moved shared author-query matching helpers (normalize_author_fragment, extract_author_filter, find_author_search_entry) into templates/common.py and reused them across author templates.

Live-eval robustness improvement

Added a controlled fallback in find_author_search_entry:

  • Prefer exact author + exact sort match (sort=editions) when available
  • If missing, optionally fall back to unsorted author-matched search data

This preserves strict behavior by default and only enables fallback where needed (96/97/98) to reduce fragile not_collected outcomes in live runs.

Testing

  • tests/plugins/openlibrary/test_engagement_templates.py: 65 passed
  • tests/plugins/openlibrary: 106 passed
  • full repository suite: 360 passed

Added tests cover:

  • generation invariants and template registration
  • GT correctness, tie-breaking, missing data, no-match cases
  • helper behavior (sort strictness, plain-text author matching, unsorted fallback)
  • regression checks for live-like unsorted query collection

Edge Cases Handled

  • Tied metric values -> alphabetically earlier title/author wins
  • Missing metric field -> explicit GT failure
  • Exact threshold boundary -> strict > (equality not counted)
  • Zero matches -> "0" valid
  • Insufficient collected works -> explicit failure with required/actual count
  • Invalid validation inputs -> type-checked failure
  • Unsorted author search collection in live runs -> fallback supported for new templates

MkDev11 added 8 commits March 23, 2026 02:07
Move normalize_author_fragment, extract_author_filter, and
find_author_search_entry from author_editions.py class methods into
common.py as module-level functions. This eliminates duplication for
upcoming author-based templates that need the same lookup logic.
Find the book with the highest/lowest engagement metric among an
author's top N search results. Uses confirmed-visible fields only:
want_to_read_count, already_read_count, ratings_count.

Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680.
Compare aggregate engagement metrics between two authors' top N search
results. Requires two separate author searches and cross-page comparison.

Variant space: C(70,2) × 3 metrics × 2 counts = 14,490.
Count books in an author's catalog meeting an engagement threshold.
Requires scanning each book's metric against a threshold — cannot be
solved by sorting a single column.

Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680.
56 tests covering:
- Template registration and generation invariants
- author_engagement_extrema GT: highest/lowest, tie-breaking, missing data
- author_comparison GT: higher total, reverse winner, tie, missing author
- reading_stats_filter GT: threshold counting, zero matches, exact boundary
- Task registry wiring (IDs 96, 97, 98, Version 7)
- Shared helper refactoring (common.py functions)
- Cross-template consistency (serialization, GT source, cache source)
@MkDev11 MkDev11 force-pushed the feat/openlibrary-new-templates branch from 355acae to 6a893b4 Compare March 23, 2026 03:06
@MkDev11
Copy link
Contributor Author

MkDev11 commented Mar 23, 2026

@angosr can you take a look at the pr and let me know your feedback?

Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: 3 New Open Library Engagement Templates

Overall a well-structured PR with clean commit history, thorough tests (65 passed), and good refactoring of shared helpers into common.py. The template designs are sound in principle. However, there are blocking data-reliability issues that must be fixed before merge.


BLOCKING: 4 authors in AUTHOR_POOL are broken on the API (Red Team Check 1 + 6)

I queried all 70 authors in AUTHOR_POOL against the live Open Library API. Four authors return fewer than 10 results, making them unusable for templates that request work_count up to 10:

Author Results found Notes
Fyodor Dostoevsky 1 OL stores under Russian name variants
Haruki Murakami 6 OL stores under Japanese romanization variants
Anton Chekhov 3 Same issue as Dostoevsky
Octavia Butler 1 Likely name mismatch

When these authors are selected with work_count=7 or work_count=10, GT will always fail with "Only N works collected, need M". This is not a graceful degradation — it's a guaranteed failure for ~6% of the author pool.

Fix required: Either remove these 4 authors from AUTHOR_POOL, or add validated aliases that return sufficient results (e.g., "Dostoyevsky" instead of "Dostoevsky"). Validate by querying every author and asserting numFound >= 10.

BLOCKING: ratings_count is frequently missing from API responses (Red Team Check 1)

The API omits ratings_count entirely (not zero, just absent) for many works. Affected authors include:

Author Results with ratings_count present (out of 10)
Leo Tolstoy 0/10
Walt Whitman 2/10
Emily Dickinson 2/10
Rabindranath Tagore 3/10
Emily Bronte 4/10

When ratings_count is the selected metric, parse_numeric(None) returns None, and GT correctly fails with "Missing 'ratings_count' for work...". But this means ~7% of authors (5/70) will produce GT failures for 1 of 3 metrics, silently reducing the effective variant space.

Fix required: Either:

  • (a) Filter AUTHOR_POOL to only authors where all 3 metrics are present in top 10 results, or
  • (b) Treat missing ratings_count as 0 (with a documented rationale), or
  • (c) Remove ratings_count as a metric and use only want_to_read_count and already_read_count (which are present for nearly all works)

Option (a) is cleanest. The template test suite should include a test that validates all authors have the required fields.

BLOCKING: already_read_count not visible on search results page

The question text says: "Among the first N results, which book has the highest already-read count?"

But already_read_count is NOT displayed on the Open Library search results page. It only appears on individual work detail pages. This means:

  • The agent cannot answer the question by reading the search results page alone
  • The agent would need to click into each of the N detail pages, which is a much harder task than the template's expected_steps suggests (7-8 steps for scanning up to 10 detail pages?)
  • GT collection uses api_data from the search API (which includes already_read_count), but the agent sees the rendered page where this field is absent

This is a Data Visibility failure (CLAUDE.md Template Validation Checklist item 3: "Required data must appear in the page accessibility tree or visible content, not just in the API").

Fix required: Either:

  • (a) Remove already_read_count from the metric options and only use want_to_read_count and ratings_count (both visible on search results), or
  • (b) Redesign the templates to explicitly require visiting detail pages and adjust expected_steps accordingly, or
  • (c) Verify that already_read_count IS actually visible in the accessibility tree of search results (I may be wrong — check via Playwright snapshot)

NON-BLOCKING: allow_unsorted_fallback=True weakens GT determinism

The 3 new templates all call find_author_search_entry(..., allow_unsorted_fallback=True), while the existing author_editions template uses the default False. This means:

  • If the agent searches without sort=editions, the GT will still compute using unsorted results
  • The question explicitly says "sorted by most editions" — using unsorted data means the "top N results" in GT may differ from what the question describes
  • This trades GT correctness for reduced not_collected rate

This is pragmatic but should be documented in a code comment explaining the tradeoff. The existing author_editions template intentionally does NOT use this fallback — the asymmetry should be justified.

NON-BLOCKING: pr_description.md committed to repository

This file should not be in the codebase. PR descriptions belong in the GitHub PR body only. Remove this file.

NON-BLOCKING: Template 97 (comparison) has 50% random baseline

Binary-choice questions ("which author has more X?") have a 50% random baseline. Per CLAUDE.md Red Team rules, binary questions need >70% world knowledge accuracy to fail. This passes, but it is the weakest template design.

Consider requiring the agent to report the actual sums for both authors (e.g., "Author A: 5,230, Author B: 3,410 → Author A") instead of just the winner name. This would drop the random baseline to near-zero and make validation more precise.

NON-BLOCKING: Red Team Review results not documented in PR

Per CLAUDE.md: "Results are documented in the PR review with pass/fail per check." The PR description does not include Red Team Review results. Here's the summary:

Check Template 96 Template 97 Template 98
1. API Semantic PASS (missing field caveat) PASS (missing field caveat) PASS (missing field caveat)
2. World Knowledge PASS (~45%) PASS (~55%) PASS (~25%)
3. Memorization Space PASS (1,680) PASS (14,490) PASS (1,680)
4. Answer Stability MARGINAL PASS MARGINAL PASS MARGINAL PASS
5. Random Baseline PASS (10-33%) PASS but weakest (50%) PASS (9-17%)
6. Cross-Parameter Collapse PASS (0%) PASS PASS

Summary

Item Status
4 broken authors in AUTHOR_POOL BLOCKING — remove or fix
ratings_count missing for ~7% of authors BLOCKING — filter pool or drop metric
already_read_count not visible on search page BLOCKING — verify visibility or drop metric
allow_unsorted_fallback asymmetry Non-blocking, add comment
pr_description.md in repo Non-blocking, remove
Template 97 binary baseline Non-blocking, consider redesign
Tests 65/65 passed ✓
Code quality Clean, good refactoring ✓
No file conflicts with main Confirmed ✓

…rop already_read_count, clean up

BLOCKING fixes:
- Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results:
  Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse
  ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman,
  Dickinson, Tagore). Pool: 70 → 61.
- Remove already_read_count from EngagementMetric, AuthorMetric, and
  ReaderMetric enums — not visible on search results page (only
  want_to_read and ratings counts are rendered).

NON-BLOCKING fixes:
- Add comment in author_editions.py documenting allow_unsorted_fallback
  asymmetry between existing and new templates.
- Remove pr_description.md from repository.

Tests updated to reflect metric and pool changes. 106 passed.
@MkDev11 MkDev11 requested a review from angosr March 25, 2026 03:10
MkDev11 added 2 commits March 25, 2026 04:16
The OL API omits count fields (ratings_count, want_to_read_count) when
the value is zero, rather than returning 0. Previously the GT methods
returned GroundTruthResult.fail() for missing fields, causing hard
failures for works that simply haven't been rated yet.

Now treats absent metrics as 0.0, which is semantically correct and
consistent with how the OL API represents zero-count data. This
prevents GT failures for individual works missing ratings_count even
among authors that generally have good data coverage.

Also fixes _make_search_entry type hint (sort: Optional[str]) and
removes unused title variables flagged by ruff.
If a metric field contains a non-numeric string like 'N/A',
parse_numeric() returns None. Previously this None was passed to
int(value) or numeric comparisons, causing a TypeError at runtime.

Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None.
This covers both absent fields (raw is None) and non-numeric strings
(parse_numeric returns None).

Adds regression test for 'N/A' metric values.
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third Review — Updated PR (commits 6d436d5, 2490e6a, 1608d88)

Good progress addressing the previous review. Here's where things stand.

Resolved Issues ✓

Previous Issue Resolution
4 broken authors Fixed — Removed Dostoevsky, Murakami, Chekhov, Octavia Butler, Emily Bronte, Leo Tolstoy, Walt Whitman, Emily Dickinson, Rabindranath Tagore (9 total). Pool is now 61. All 61 return ≥10 results.
already_read_count not visible on search page Fixed — Dropped from all 3 templates. Only want_to_read_count and ratings_count remain.
pr_description.md in repo Fixed — Deleted.
allow_unsorted_fallback asymmetry Fixed — Added explanatory comment in author_editions.py (line 183-186).
Variant space claims Verified — 976, 7320, 976 all correct and >500.
Tests 107/107 passed
No file conflicts with main Still clean ✓

BLOCKING: ratings_count missing-as-zero creates unreliable GT for "lowest" extrema

I queried all 61 authors: 34/61 (56%) have at least one missing ratings_count in their top 10 results. Examples:

Author Missing ratings_count (out of 10)
James Joyce 5/10
Ralph Waldo Emerson 5/10
Herman Melville 4/10
Philip K. Dick 4/10
Jane Austen 3/10

When extrema=lowest and metric=ratings_count, the missing-as-zero treatment means the answer will always be the alphabetically first book with a missing field — not the book with genuinely fewest ratings. This is semantically wrong: the question asks "which book has the fewest ratings" but GT answers "which book has missing API data."

The agent, looking at the actual search results page, may see a different ratings_count displayed (the page shows it even when the API omits it, via schema.org markup), creating a GT/agent disagreement.

In contrast, want_to_read_count is missing for only 4/61 authors — far less problematic.

Three possible fixes (pick one):

(a) Drop ratings_count entirely — use only want_to_read_count. This cuts variant space to 488 (T96: 61×2×1×4), 244 (T98: 61×1×4×2), 3660 (T97: C(61,2)×1×2). T96 and T98 fall below the 500 minimum. Would need to compensate by expanding result counts or adding back more authors.

(b) Exclude ratings_count from extrema-lowest only — keep it for highest-extrema, comparison, and filter where the bias is negligible. This requires per-template metric filtering, adds complexity.

(c) Filter AUTHOR_POOL to only authors with 0 missing ratings_count in top 10 — this removes 34 authors, leaving 27. Pool too small.

My recommendation: option (a) + expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98. This gives T96: 61×2×1×5 = 610 (>500), T98: 61×1×5×2 = 610 (>500), T97: C(61,2)×1×2 = 3660 (>500). Clean and simple.

Alternatively, if you want to keep ratings_count, option (b) is acceptable — but document the asymmetry.

NON-BLOCKING: Non-numeric metric treatment is silent

Commit 1608d88 treats non-numeric values (e.g., "N/A") as 0:

raw = work.get(metric)
parsed = parse_numeric(raw) if raw is not None else None
value = parsed if parsed is not None else 0.0

This is fine for missing fields (None), but silently treating a string like "N/A" as 0 violates the No Fallback principle. If the API returns a non-null, non-numeric value, something unexpected is happening and should raise. Consider:

raw = work.get(metric)
if raw is None:
    value = 0.0  # API omits zero-valued counts
else:
    parsed = parse_numeric(raw)
    if parsed is None:
        return GroundTruthResult.fail(f"Non-numeric '{metric}' value '{raw}' for '{title}'")
    value = parsed

This preserves the missing-as-zero behavior while still catching actual data corruption.

NON-BLOCKING: Template 97 binary baseline (50%) — still the weakest design

Still a binary-choice question. Consider but not blocking.


Summary

Item Status
ratings_count missing in 56% of authors → wrong "lowest" GT BLOCKING
Non-numeric silently treated as 0 Non-blocking, recommend fail on non-null non-numeric
Broken authors removed ✓ Resolved
already_read_count dropped ✓ Resolved
pr_description.md deleted ✓ Resolved
allow_unsorted_fallback documented ✓ Resolved
Tests pass ✓ 107/107

MkDev11 added 2 commits March 25, 2026 04:43
The 3-line metric normalization pattern (raw → parse_numeric → fallback
to 0.0) was duplicated across all 3 new templates. Extracted to
safe_metric_value() in common.py, reducing each call site to a single
line and ensuring consistent handling of absent/non-numeric fields.
BLOCKING: ratings_count is missing for 56% of authors in the OL API,
causing wrong GT for extrema-lowest queries (missing-as-zero always
wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and
ReaderMetric — all templates now use only want_to_read_count.

Expanded RESULT_COUNTS to keep variant space above 500 minimum:
- T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610
- T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660
- T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732

NON-BLOCKING: safe_metric_value now raises ValueError on non-null
non-numeric values (e.g. 'N/A') instead of silently treating them
as 0. Missing (None) values still default to 0. Callers catch
ValueError and surface it as GroundTruthResult.fail().
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fourth Review — commit c99582f (extract safe_metric_value helper)

This commit is a clean refactoring: extracts the duplicated missing-metric-as-zero logic into safe_metric_value() in common.py. No behavioral change. Tests pass (107/107).

The refactoring itself is fine — the helper is well-named and the docstring clearly explains the rationale.


Previous blocking issue is NOT addressed

The ratings_count reliability problem from the third review remains:

  • 34/61 authors (56%) have missing ratings_count in their top-10 API results
  • safe_metric_value treats missing as 0.0
  • When extrema=lowest + metric=ratings_count: GT answer = alphabetically first book with missing API data, NOT the book the agent sees as having fewest ratings on the page
  • This is a semantic correctness issue, not a crash risk

The refactoring consolidates the logic into one place (good), but the underlying data problem is unchanged.

What's still needed

One of:

(a) Drop ratings_count — keep only want_to_read_count (missing for only 4/61 authors). Expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98 to maintain variant space above 500:

  • T96: 61 × 2 × 1 × 5 = 610
  • T97: C(61,2) × 1 × 2 = 3,660
  • T98: 61 × 1 × 5 × 2 = 610

(b) Exclude ratings_count from ExtremaType.LOWEST — keep it for highest/comparison/filter where the bias is negligible. This means the metric pool depends on the extrema direction, which adds complexity but preserves variant space.

(c) Accept the risk and document it — add a code comment noting that lowest + ratings_count on authors with sparse data produces unreliable GT, and accept this as a known limitation. I would accept this approach if accompanied by a test that demonstrates the behavior so it's not silently forgotten.

Status

Item Status
safe_metric_value helper ✓ Clean refactoring
Tests ✓ 107/107
ratings_count missing in 56% of authors Still blocking — not addressed
Everything else from previous reviews ✓ Resolved

@MkDev11
Copy link
Contributor Author

MkDev11 commented Mar 25, 2026

Fourth Review — commit c99582f (extract safe_metric_value helper)

This commit is a clean refactoring: extracts the duplicated missing-metric-as-zero logic into safe_metric_value() in common.py. No behavioral change. Tests pass (107/107).

The refactoring itself is fine — the helper is well-named and the docstring clearly explains the rationale.

Previous blocking issue is NOT addressed

The ratings_count reliability problem from the third review remains:

  • 34/61 authors (56%) have missing ratings_count in their top-10 API results
  • safe_metric_value treats missing as 0.0
  • When extrema=lowest + metric=ratings_count: GT answer = alphabetically first book with missing API data, NOT the book the agent sees as having fewest ratings on the page
  • This is a semantic correctness issue, not a crash risk

The refactoring consolidates the logic into one place (good), but the underlying data problem is unchanged.

What's still needed

One of:

(a) Drop ratings_count — keep only want_to_read_count (missing for only 4/61 authors). Expand RESULT_COUNTS to [3, 5, 7, 10, 15] for T96/T98 to maintain variant space above 500:

  • T96: 61 × 2 × 1 × 5 = 610
  • T97: C(61,2) × 1 × 2 = 3,660
  • T98: 61 × 1 × 5 × 2 = 610

(b) Exclude ratings_count from ExtremaType.LOWEST — keep it for highest/comparison/filter where the bias is negligible. This means the metric pool depends on the extrema direction, which adds complexity but preserves variant space.

(c) Accept the risk and document it — add a code comment noting that lowest + ratings_count on authors with sparse data produces unreliable GT, and accept this as a known limitation. I would accept this approach if accompanied by a test that demonstrates the behavior so it's not silently forgotten.

Status

Item Status
safe_metric_value helper ✓ Clean refactoring
Tests ✓ 107/107
ratings_count missing in 56% of authors Still blocking — not addressed
Everything else from previous reviews ✓ Resolved

@angosr this is already pushed and on the branch. please review again

@MkDev11 MkDev11 requested a review from angosr March 25, 2026 04:08
…son/filter

- Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py
  that still mentioned 'ratings' after ratings_count was dropped.
- Add non-numeric metric regression tests for comparison and filter templates
  to match the existing extrema test, ensuring all 3 safe_metric_value
  call sites are explicitly tested for ValueError handling.
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fifth Review — commits 5dceaec, c5fa4e1

All previously blocking issues are resolved. Approving with minor notes.

Resolved Issues ✓

Issue Resolution
ratings_count missing in 56% of authors Fixed — Dropped from all 3 templates. Only want_to_read_count remains.
Variant space below 500 after dropping metric FixedRESULT_COUNTS expanded to [3,5,7,10,15] (T96/T98) and [5,10,15] (T98). Verified: T96=610, T97=3660, T98=732. All >500.
Silent non-numeric fallback Fixedsafe_metric_value now raises ValueError on non-null non-numeric, caught as GroundTruthResult.fail() in all 3 templates.
Tests 109/109 passed
Merge against main Clean, no conflicts ✓

Remaining minor concern (NON-BLOCKING)

want_to_read_count is still missing for 10/61 authors in their top-15 results (mostly at high rank positions — rank 8-15). With safe_metric_value treating missing as 0:

  • work_count=3: 0/61 authors affected (all missing values rank >5)
  • work_count=5: 1/61 (Ralph Waldo Emerson, missing at rank 5)
  • work_count=10: 4/61 affected
  • work_count=15: 10/61 affected

For extrema=lowest, this means ~2.6% of T96 variants may produce a GT answer based on missing data rather than genuine lowest value. For T97 (comparison) and T98 (filter), the impact is negligible (slight undercount).

This is acceptable given: (a) the affected fraction is small, (b) highest extrema is unaffected, (c) comparison/filter impact is minimal. No action needed now, but consider removing Ralph Waldo Emerson (rank-5 missing) in a future cleanup if live eval shows GT failures for that author.

Final verification summary

  • ratings_count / already_read_count fully removed from all 3 templates ✓
  • safe_metric_value: None → 0.0, non-numeric → ValueError → GT fail ✓
  • Variant space: 610, 3660, 732 (math verified) ✓
  • AUTHOR_POOL: 61 authors, all return ≥15 results ✓
  • pr_description.md deleted ✓
  • allow_unsorted_fallback documented in author_editions.py ✓
  • Clean merge against main (no conflicts with c087f89) ✓

Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sixth Review — Revoking approval after anti-memorization deep-dive

After approval I performed a thorough anti-memorization stress test. The results reveal a fundamental design weakness that was not visible in earlier reviews focused on code correctness.


BLOCKING: All 3 templates are fully enumerable from 61 API calls

After dropping ratings_count and already_read_count, every template uses a single metric (want_to_read_count) over a fixed pool of 61 authors. This makes the entire answer space trivially enumerable:

Attack procedure:

  1. Call https://openlibrary.org/search.json?q=author:"<name>"&sort=editions&limit=15&fields=title,want_to_read_count for each of the 61 authors (~30 seconds total)
  2. Compute all answers offline:
    • T96: 61 authors × 5 work_counts × 2 extrema = 610 entries
    • T97: C(61,2) × 2 work_counts = 3,660 entries
    • T98: 61 authors × 4 thresholds × 3 work_counts = 732 entries
  3. Total lookup table: ~5,000 entries — small enough to embed in a system prompt

Answer stability makes this worse: want_to_read_count is cumulative (only grows), at a rate of ~5-75 per year depending on the book. The lookup table remains valid for months to years. The "highest" winner for a given author almost never changes (typical gap between #1 and #2 is hundreds to thousands).

This violates CLAUDE.md Red Team Check 4 (Answer Stability): "If the same question produces the same answer for >30 days, the template's anti-memorization claim is weak. Must combine with a large variant space (>1000) to compensate." T96 (610) and T98 (732) are below the 1000 compensating threshold.

BLOCKING: T97 is vulnerable to world knowledge attack (~65%)

For the comparison template, the question is binary: "which author has more total want-to-read?" An LLM can estimate this from general fame:

Pair type LLM accuracy Proportion of pairs
Clear fame gap (e.g. J.K. Rowling vs Samuel Beckett) ~95% ~40%
Cross-tier (e.g. Stephen King vs Jules Verne) ~75% ~35%
Same tier (e.g. Franz Kafka vs Ray Bradbury) ~50% ~25%
Weighted average ~65%

The CLAUDE.md threshold for binary questions is 70%. At ~65%, this technically passes but is dangerously close. Combined with the lookup table vulnerability above, the template offers minimal anti-memorization value.

Verified data point: the top-5 want_to_read_count ranking (J.K. Rowling > Stephen King > George Orwell > Jane Austen > Roald Dahl) is largely predictable from world knowledge. P.G. Wodehouse at dead last (73 total) vs J.K. Rowling (35,278) is a 483× gap that any LLM can identify.

Root cause: single metric + slow-changing data + small entity pool

The original design with 3 metrics over 70 authors was borderline. After necessary data-quality fixes (removing broken authors, dropping unreliable metrics), the remaining design space collapsed:

Dimension Original After fixes Problem
Authors 70 61 Still OK
Metrics 3 1 Eliminates combinatorial diversity
Data volatility "Dynamic" ~75/year growth Effectively static

Recommended path forward

The templates are well-engineered (clean code, good GT logic, proper edge case handling). The problem is the data source, not the implementation. Options:

(a) Add a truly dynamic dimension. For example, instead of absolute want_to_read_count, use relative ranking changes: "Among Author X's top N books, which moved up the most in want-to-read count compared to Y days ago?" This requires storing historical snapshots but makes lookup tables expire quickly.

(b) Expand the entity pool dramatically. The AUTHOR_POOL could be generated dynamically from OL's trending/popular lists rather than a fixed list of 61. If the pool is 500+ authors sampled at eval time, enumeration becomes impractical.

(c) Combine with a second data source. Cross-site templates (like the existing hybrid templates) are much harder to enumerate because the answer depends on correlated data from two independent sites.

(d) Restore ratings_count with targeted exclusions. Instead of dropping it globally, exclude it only from extrema=lowest queries (where missing-as-zero causes wrong GT). This restores the metric dimension for ~80% of variants while avoiding the data quality issue. Combined with filtering out the 5 worst-affected authors (Emerson, Joyce, Melville, Hawthorne, P.K. Dick), this brings variant space back to meaningful levels with real diversity.


Summary

Issue Status
Code quality, tests, edge cases ✓ Excellent
GT correctness ✓ Fixed across 5 rounds
Lookup table: 61 API calls → all answers BLOCKING
T96/T98 variant space + stability < 1000 threshold BLOCKING
T97 world knowledge ~65% on binary choice Borderline, not blocking alone

MkDev11 added 3 commits March 25, 2026 09:13
…tion

BLOCKING: With a single metric (want_to_read_count), the entire answer
space was enumerable from 61 API calls (~5,000 entries). Restoring
ratings_count as a second metric dimension breaks trivial enumeration.

Changes:
- Remove 5 authors with worst ratings_count coverage (Emerson, Joyce,
  Melville, Hawthorne, P.K. Dick). Pool: 61 → 56.
- Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric.
- T96: exclude ratings_count from extrema=lowest only (where
  missing-as-zero would always win). Highest/comparison/filter are
  unaffected by the bias.
- T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values).
- Restore THRESHOLDS for ratings_count in T98.

Variant spaces (all >1000):
- T96: 56 × (highest×2 + lowest×1) × 6 = 1,008
- T97: C(56,2) × 2 × 2 = 6,160
- T98: 56 × 2 × 4 × 3 = 1,344

Adds test_extrema_lowest_excludes_ratings_count to verify the
per-extrema metric filtering. 364 tests pass.
…t space

- Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization
- Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25]
  to increase lowest-extrema differentiation
- Effective variant space: ~583 (16.6% margin above 500 threshold)
- Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants
- Fix AUTHOR_POOL section comments to reflect actual counts
- Split test file (481+490 lines, both <500)
- Remove unused get_registered_templates import
- Add tests: pool size=81, no duplicates, ratings_count GT
The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing
guaranteed GT failure for 1/7 of T96 variants. Raise limit to match.

Add regression test: test_extrema_gt_succeeds_with_25_works
@MkDev11 MkDev11 requested a review from angosr March 25, 2026 12:00
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seventh Review — commits 6db81ec, 050b5d9, 4e0fc5c

Rigorous anti-memorization analysis performed before approving this time.


Changes in this iteration

  1. ratings_count restored with targeted exclusion — excluded only from ExtremaType.LOWEST (where missing-as-zero would always falsely "win"). Used for highest-extrema, comparison, and filter. Verified across 2000 seeds: lowest never gets ratings_count.

  2. AUTHOR_POOL expanded to 81 (+25 new authors: Charlotte Bronte, Jack London, Dan Brown, C.S. Lewis, etc.). All 81 verified to return ≥25 results from the API.

  3. RESULT_COUNTS expanded to [3,5,7,10,15,20,25] for T96, [5,10,15] for T98. API fetch limit raised from 20→25 in openlibrary.py to support work_count=25.

  4. Tests: 115/115 passed.


Anti-memorization verification (Red Team Checks)

Check 1 — API Semantic Verification: PASS

  • sort=editions confirmed working correctly
  • want_to_read_count present for majority of books (4/81 authors have >5/25 missing)
  • ratings_count has more gaps (10/81 authors have >15/25 missing) but only used for highest/comparison/filter where missing-as-zero is benign

Check 2 — World Knowledge Attack: PASS

  • T96 highest: ~60% (rank-1 by editions = highest wtr 60% of the time, but different work_count values change the answer — e.g., Christie top-3 winner ≠ top-7 winner)
  • T96 lowest: ~random (no world knowledge helps)
  • T97: ~65% (binary, threshold is 70%) — borderline but passes. Two metrics help: wtr winner ≠ rc winner for some pairs
  • T98: ~25% (numeric answer 0-25, unpredictable)

Check 3 — Memorization Space: PASS

  • T96: 1,701 variants (verified: 81×2×7 + 81×1×7)
  • T97: 12,960 variants (verified: C(81,2)×2×2)
  • T98: 1,944 variants (verified: 81×(4+4)×3)
  • All >1,000 (CLAUDE.md compensating threshold for slow-changing data)

Check 4 — Answer Stability: MARGINAL PASS (compensated)

  • want_to_read_count grows at ~5-75/year depending on book popularity
  • Answers stable for weeks-to-months for most variants
  • Compensated by variant space >1,000 per CLAUDE.md: "Must combine with large variant space (>1000) to compensate"

Check 5 — Random Baseline: PASS

  • T96: 1/N where N∈{3,5,7,10,15,20,25} → 4-33%
  • T97: 50% (binary) — under 70% combined threshold for binary questions
  • T98: 1/(N+1) where N∈{5,10,15} → 6-17%

Check 6 — Cross-Parameter Collapse: PASS

  • Different authors produce genuinely different wtr values (73 to 35,278 for top-5 sums)
  • Different work_count values change the answer for the same author (verified: Christie top-3 highest ≠ top-7 highest)
  • Two metrics provide independent variation axes

Lookup table concern — accepted limitation

81 API calls can still enumerate all ~16,600 entries. However:

  1. This is inherent to ALL Open Library templates (existing author_editions is equally enumerable from the same API calls with only 324 entries)
  2. CLAUDE.md explicitly allows this pattern when compensated by >1,000 variant space
  3. The new templates are actually harder to enumerate than existing OL templates (16,605 vs 324 entries)

Remaining non-blocking notes

  • Bram Stoker has 9/25 missing want_to_read_count — worst in the pool. With work_count=25, 9 books treated as 0 creates moderate undercount for filter/comparison. Consider removing in future cleanup if live eval shows issues.
  • T97 binary baseline (50%) remains the weakest design point. Documented, not blocking.

Final status

Item Status
Code quality & tests ✓ 115/115 passed
GT correctness ✓ Targeted exclusion for lowest+ratings_count
Red Team Check 1 (API Semantic) ✓ PASS
Red Team Check 2 (World Knowledge) ✓ PASS (<60%/70%)
Red Team Check 3 (Memorization Space) ✓ PASS (1701, 12960, 1944)
Red Team Check 4 (Answer Stability) ✓ MARGINAL PASS (compensated >1000)
Red Team Check 5 (Random Baseline) ✓ PASS
Red Team Check 6 (Cross-Parameter Collapse) ✓ PASS
Merge against main ✓ Clean, no conflicts
AUTHOR_POOL (81) all return ≥25 results ✓ Verified
openlibrary.py limit raised to 25 ✓ Matches max work_count

Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eighth Review — Revoking approval (again). Three issues found.

I re-examined more carefully before this approval would stand. Found three problems I missed.


BLOCKING: Shared AUTHOR_POOL mutation breaks existing template reproducibility

AUTHOR_POOL in author_editions.py is imported by the existing author_editions template AND the 3 new templates. This PR removes 14 authors and adds 25, changing the pool from 70 → 81.

10 of the 14 removed authors worked fine for author_editions (they only had want_to_read_count / ratings_count issues, not edition_count issues):

Removed Author numFound Why removed Needed for author_editions?
Herman Melville 800+ wtr missing at high ranks ✓ Yes, edition_count fine
Nathaniel Hawthorne 600+ wtr missing at high ranks ✓ Yes
Philip K. Dick 300+ wtr missing at high ranks ✓ Yes
James Joyce 500+ wtr missing at high ranks ✓ Yes
Ralph Waldo Emerson 1000+ wtr missing at rank 5 ✓ Yes
Emily Bronte 400+ weak ratings_count ✓ Yes
Leo Tolstoy 29 weak ratings_count ✓ Yes
Walt Whitman 1200+ weak ratings_count ✓ Yes
Emily Dickinson 600+ weak ratings_count ✓ Yes
Rabindranath Tagore 2000+ weak ratings_count ✓ Yes

Since author_editions.generate() uses rng.choice(AUTHOR_POOL), changing the pool changes what seed=42 produces. This means:

  • Existing evaluation results become non-reproducible
  • Cached evaluations from before the merge produce different questions
  • Any SFT training data generated with the old pool is invalidated

Fix: Create a separate ENGAGEMENT_AUTHOR_POOL for the new templates. Keep the original AUTHOR_POOL unchanged for author_editions. The new pool can be a curated subset + additions.

BLOCKING: want_to_read_count has the same missing-as-zero problem at high work_counts

I verified all 81 authors at limit=25: 33/81 (41%) have missing want_to_read_count in their top 25 results.

Worst cases for lowest extrema + want_to_read_count:

Author Missing wtr in top 25
Kazuo Ishiguro 10/25
Miguel de Cervantes 9/25
Bram Stoker 9/25
Virginia Woolf 5/25
Mary Shelley 5/25

With work_count=25 and extrema=lowest, Kazuo Ishiguro has 10 books with missing wtr treated as 0. The GT answer = alphabetically first among 10 missing books — the exact same semantic bug that was blocking for ratings_count.

The _LOWEST_METRICS exclusion correctly blocks ratings_count from lowest, but want_to_read_count has the same problem at work_count ≥ 10.

Fix: Cap RESULT_COUNTS for lowest extrema at a safe level. From the data:

  • work_count=3: 0/81 authors affected
  • work_count=5: 1/81 (Kazuo Ishiguro, missing at rank 10 — actually safe at 5)
  • work_count=7: still safe for most

Alternatively, add _LOWEST_RESULT_COUNTS = [3, 5, 7] and use the full range only for highest. This parallels the metric exclusion approach already in place.

NON-BLOCKING: openlibrary.py limit 20→25 is a global side effect

fetch_search_api_data(query, limit=25, ...) applies to ALL search page API fetches, not just new templates. This means:

  • Existing templates get 25 works in api_data instead of 20
  • Cache entries become larger
  • API calls are slightly slower

This should work fine in practice (more data is generally better for GT), but it's an undocumented side effect. Add a comment explaining why the limit was raised.


Summary

Issue Severity
AUTHOR_POOL mutation breaks existing template reproducibility BLOCKING
want_to_read_count missing at work_count ≥ 10 for lowest extrema BLOCKING
Global limit=25 in openlibrary.py Non-blocking, add comment

…COUNTS

Address PR review AffineFoundation#8:

1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main
   to preserve author_editions reproducibility. Create separate
   ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98.

2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to
   avoid missing-as-zero domination of want_to_read_count at high
   work_counts (41% of authors affected at work_count=25).

3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py.

Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.
eureka928 added a commit to eureka928/liveweb-arena that referenced this pull request Mar 25, 2026
Avoid collision with PR AffineFoundation#13 (OpenLibrary) which claims IDs 96-98.
@MkDev11 MkDev11 requested a review from angosr March 26, 2026 04:32
eureka928 added a commit to eureka928/liveweb-arena that referenced this pull request Mar 26, 2026
Avoid collision with PR AffineFoundation#13 (OpenLibrary) which claims IDs 96-98.
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #13 — feat(openlibrary): add 3 engagement & comparison templates

Verdict: REQUEST CHANGES — Fundamental design issues + Red Team failures


BLOCKING: allow_unsorted_fallback=True violates No Fallback rule

CLAUDE.md §9: "All evaluation logic must be deterministic. Never add fallback/default values to mask errors."

All three new templates use find_author_search_entry(..., allow_unsorted_fallback=True), meaning if the agent searches without sort=editions, the GT falls back to unsorted search data. This is fundamentally wrong:

  • The question explicitly says "sorted by most editions"
  • Unsorted results may have completely different ranking, changing which books are in top N
  • GT computed from unsorted data can produce a different answer than the correctly-sorted page shows
  • This masks agent errors (wrong navigation) as correct answers

The existing author_editions template correctly does NOT use this fallback. The comment in the code even acknowledges the asymmetry. If agents need sort=editions, the template's start_url already navigates them there — fallback is unnecessary.

Fix: Remove allow_unsorted_fallback=True from all three templates. If GT returns not_collected because the agent didn't visit the sorted page, that's the correct outcome.

BLOCKING: safe_metric_value returning 0.0 for None violates No Fallback

safe_metric_value(work, metric) returns 0.0 when the metric field is absent. This is defended as "OL API omits count fields when the value is zero" but:

  1. This is an assumption about API behavior, not a documented guarantee
  2. For ratings_count, absence is common and doesn't always mean zero — it can mean the work page hasn't been rated yet (which is different from "has 0 ratings")
  3. For T96 (engagement_extrema with extrema=lowest), missing-as-zero directly determines the winner — the PR even acknowledges this by capping _LOWEST_RESULT_COUNTS to [3,5,7] and excluding ratings_count from lowest. This is a patch on a patch — the root issue is that 0.0 default is unreliable

Verified via API: for authors like Daphne du Maurier, 40% of top-10 results have ratings_count missing/zero. For these works, the template blindly assigns 0, which may be wrong.

Fix: For ratings_count, treat missing as GT failure (not 0). Only want_to_read_count can arguably default to 0 since it has much higher coverage.

BLOCKING: T97 (author_comparison) fails Red Team Check 5 — 50% random baseline

T97 asks "Which author has higher X?" — the answer is one of two names. This is a binary choice with 50% random baseline.

CLAUDE.md Red Team §5: "Templates with >33% random baseline must require exact numeric answers or multi-step computation to compensate."

T97's answer is an author name (not numeric). A coin flip gets 50% correct. The PR's Red Team section does not even address this check for T97. This is a blocking failure.

Fix: Redesign T97 to require a numeric answer (e.g., "What is the difference in total want-to-read counts between Author A and Author B?") or restructure to avoid binary outcomes.

BLOCKING: No eval.py test results

CLAUDE.md §8: "Every new template must be tested via eval.py with multiple seeds (10-minute timeout)."

The PR only shows unit test results (360 passed). No eval.py run is documented. Unit tests with mocked data cannot verify:

  • GT data source binding (does GT use page cache data?)
  • Agent solvability (can a real agent navigate to the right pages?)
  • Data visibility (are engagement metrics readable in the accessibility tree?)

BLOCKING: Version 7 conflict with PR #14

Both PR #13 and PR #14 register templates as "Version 7" in task_registry.py. They will conflict on merge. Coordinate IDs.

Non-blocking issues

  1. ENGAGEMENT_AUTHOR_POOL duplication: 56 of 81 authors duplicate AUTHOR_POOL entries. Consider making ENGAGEMENT_AUTHOR_POOL a superset reference rather than a copy, to avoid drift.

  2. ratings_count data sparsity verified: API calls confirm 20-40% of top-10 results for engagement-only authors (du Maurier, Collins, Forster, Mann) have missing ratings_count. Even after removing 5 authors, the pool still contains authors with significant sparsity.

  3. T96 variant space margin: T96 claims 1,377 variants but the effective space for lowest-extrema is only 81 × 1 × 3 = 243 (below the 500 threshold). The combined count (highest + lowest) passes, but lowest-extrema alone does not.

API Verification Results

Verified against live OL API:

  • want_to_read_count IS visible on search pages (e.g., "620 Want to read" for Agatha Christie)
  • ratings_count IS visible (e.g., "4.0 (84 ratings)")
  • already_read_count correctly dropped (only "Already Read" button shown, no count)
  • ✅ Cross-parameter collapse check passes: different authors produce different extrema winners
  • ❌ ratings_count sparsity remains problematic for 20-40% of engagement-only authors

Required Actions

  1. Remove allow_unsorted_fallback=True from all three templates
  2. Make safe_metric_value fail on missing ratings_count instead of defaulting to 0
  3. Redesign T97 to avoid binary-choice answer format (Red Team Check 5)
  4. Run eval.py with each template and multiple seeds; document results
  5. Coordinate Version 7 numbering with PR #14

MkDev11 added 2 commits March 26, 2026 07:17
…ic GT, numeric T97, strict metrics

BLOCKING fixes:
- Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98).
  GT now strictly requires sort=editions data, matching the question text.
  If the agent doesn't visit the sorted page, GT correctly returns
  not_collected instead of silently using wrong-order results.

- Make safe_metric_value fail on missing ratings_count instead of
  defaulting to 0. Only want_to_read_count (high API coverage) defaults
  to 0 when absent. ratings_count absence raises ValueError → GT fail,
  preventing semantically wrong answers from sparse data.

- Redesign T97 (author_comparison) from binary "which author has more?"
  (50% random baseline) to numeric "what is the absolute difference?"
  (near-0% random baseline). GT returns str(abs(sum_a - sum_b)).

- Add Version 7 coordination comment for PR AffineFoundation#14 (IDs 99-101 → Version 8).

NON-BLOCKING fixes:
- Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set +
  additions list, eliminating 56-entry duplication and preventing drift.
  AUTHOR_POOL itself is unchanged (author_editions reproducibility).

- Remove stale allow_unsorted_fallback asymmetry comment from
  author_editions.py (all templates now consistently use strict sort).

Tests: 372 passed (118 OpenLibrary, 254 other).
…il from sparse OL data

ratings_count is missing for 20-40% of authors at N≥7. Restrict
ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where
coverage is highest, cutting estimated GT-fail exposure from
~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged.
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: PR #13 — Design issues RESOLVED, eval.py still required

All previously blocking design issues have been addressed:

Resolved issues

  1. allow_unsorted_fallback removed from all templates ✅ — Verified: all three template GT methods (author_engagement_extrema, author_comparison, reading_stats_filter) call find_author_search_entry(collected, search_query=..., sort=sort) without allow_unsorted_fallback. The parameter still exists in common.py (default False) but is only exercised in test code. GT is now deterministic — if the agent doesn't visit the correctly-sorted page, GT correctly returns not_collected.

  2. T97 redesigned from binary to numeric ✅ — Now asks "What is the absolute difference?" with GT returning str(abs(sum_a - sum_b)). Random baseline is near 0% (numeric answer space). Red Team Check 5 is satisfied.

  3. safe_metric_value strict for ratings_count ✅ — _ZERO_DEFAULTABLE_METRICS = frozenset({"want_to_read_count"}). Missing ratings_count raises ValueError → GT fail. This is the correct behavior: don't assume 0 for sparse data.

  4. ratings_count variants capped to low N ✅ — Smart mitigation: restrict ratings_count to N∈{3,5} where coverage is highest, reducing GT-fail exposure without dropping the metric entirely.

  5. ENGAGEMENT_AUTHOR_POOL derived from AUTHOR_POOL ✅ — Eliminates 56-entry duplication and prevents drift.

  6. Version 7 coordination ✅ — Comment added noting PR #14 must use Version 8.

Remaining BLOCKING issue

No eval.py test results documented.

CLAUDE.md §8: "Every new template must be tested via eval.py with multiple seeds (10-minute timeout)."

This is a MUST requirement for new templates. Unit tests with mocked data cannot verify:

  • GT data source binding with real page cache
  • Whether engagement metrics appear in the accessibility tree that the agent reads
  • Agent navigation feasibility (can a browser agent actually find and read these metrics?)

Required before merge: Run eval.py with templates 96, 97, 98 individually, multiple seeds each, and document:

  • GT success rate (how often does GT compute a value vs fail/not_collected?)
  • Whether api_data comes from the page cache (check logs for [GT] Visit xxx → +N items)
  • Any not_collected or unexpected GT failures

Assessment

The template design is now sound. The fixes are thorough and well-reasoned (especially the ratings_count cap strategy). Once eval.py results are provided, this PR is ready for final review.

@MkDev11
Copy link
Contributor Author

MkDev11 commented Mar 26, 2026

eval.py Results — Templates 96, 97, 98

Config: model=gpt-4o-mini, base-url=api.openai.com/v1, timeout=600s, seeds=1,2,3

Results (3 templates × 3 seeds = 9 runs)

Template Seed Author(s) GT Result Assets Failure Reason
T96 (engagement_extrema) 1 Arthur Conan Doyle not_collected 2 Agent searched unsorted (mode=everything)
T96 (engagement_extrema) 2 P.G. Wodehouse not_collected 1 Agent searched unsorted
T96 (engagement_extrema) 3 Arthur Conan Doyle not_collected 1 Agent searched unsorted
T97 (author_comparison) 1 E.M. Forster vs Kurt Vonnegut not_collected 2 Both authors searched unsorted
T97 (author_comparison) 2 Dan Brown vs R.L. Stevenson not_collected 2 Agent searched unsorted
T97 (author_comparison) 3 Gustave Flaubert vs Isabel Allende not_collected 0 Agent visited author pages, not search
T98 (reading_stats_filter) 1 Kurt Vonnegut not_collected 1 Agent searched unsorted
T98 (reading_stats_filter) 2 Robert Louis Stevenson not_collected 1 Agent searched unsorted
T98 (reading_stats_filter) 3 Isabel Allende not_collected 1 Agent searched unsorted

Key Findings

1. GT data source binding works ✅
Every page visit produces GT collection logs:

[GT] Visit openlibrary.org/search?q=Arthur+Conan+Doyle&mode=e → +25 works (search)

The GT collector successfully extracts engagement metrics (want_to_read_count, ratings_count) from the page cache. Assets collected in 8/9 runs.

2. Sort enforcement is correct ✅
All 9 runs produce the precise rejection:

Did not collect search data for author 'X' sorted by 'editions'.
Collected OL keys: ['ol:https://openlibrary.org/search?q=...&mode=everything']

GT correctly distinguishes sorted vs unsorted data and rejects the wrong sort order.

3. No unexpected GT failures ✅
Every failure is not_collected (agent navigated to wrong page), not a computation error. The safe_metric_value strict handling, extrema logic, and comparison/filter logic were never reached because the agent never provided sorted data — meaning the GT pipeline is gated correctly.

4. Engagement metrics are visible in accessibility tree ✅
The GT collector finds 25 works with engagement data on every search page visit, confirming want_to_read_count and ratings_count appear in the page content that agents read.

5. Agent navigation is the bottleneck (model limitation)
gpt-4o-mini consistently searches from the homepage (producing mode=everything URLs) instead of appending &sort=editions. The question text says "sorted by most editions" but this model doesn't translate that into OpenLibrary's sort parameter. This is a model capability issue — the existing openlibrary_author_editions template (ID 84) faces the same challenge with weaker models. A more capable agent would navigate to the start_url or manually add sort=editions.

Sample Logs

T96 seed=1 (representative):

[Actor] Generated 1 subtasks, seed=1
[Actor]   [1] Search Open Library for books by "Arthur Conan Doyle" sorted by most editions...
[Agent] Step 1/7: goto → openlibrary.org
[Agent] Step 2/7: type_role (search box)
[Agent] Step 3/7: click_role
[GT] Visit openlibrary.org/search?q=Arthur+Conan+Doyle&mode=e → +25 works (search)
...
[Agent] Completed: {'answers': {'answer1': 'Memoirs of Sherlock Holmes'}}
[GT] [answer1] FAILED: Did not collect search data sorted by 'editions'
[Actor] GT: 0 ok, 1 failed, 2 assets collected

Conclusion

The eval confirms the template design is sound:

  • GT collection pipeline binds correctly to page cache data
  • Sort enforcement works (no unsorted fallback)
  • Engagement metrics are present and extractable
  • Error messages are clear and diagnostic

The 0/9 GT-success rate is entirely due to gpt-4o-mini's inability to navigate to sorted pages — not a template design issue.

@MkDev11 MkDev11 requested a review from angosr March 26, 2026 07:40
eureka928 added a commit to eureka928/liveweb-arena that referenced this pull request Mar 26, 2026
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (3rd pass): PR #13 — eval.py posted, but GT computation never verified

Thank you for running eval.py and posting detailed results. The eval confirms the GT collection pipeline and sort enforcement work correctly. However, a critical gap remains.

BLOCKING: 0/9 GT success — end-to-end GT computation unverified

All 9 eval runs returned not_collected because gpt-4o-mini navigated to unsorted pages. This means the actual GT computation logic — safe_metric_value, extrema finding, comparison sums, threshold counting — was never exercised with real OL API data.

CLAUDE.md §5 Template Validation Checklist item 1: "GT Calculation - GT must return a concrete value. If it errors, the template is broken."

With 0 GT successes, we cannot confirm item 1. The unit tests use mocked data which may not reflect real API response structure (field names, nesting, types).

The eval correctly shows "agent fails + GT gated at sort check" which proves the sort enforcement works. But it does NOT prove "agent fails + GT succeeds" — GT never reached the computation stage.

Required: at least 1 GT success per template

Options (any one is sufficient):

(a) Re-run eval.py with a stronger model (e.g., gpt-4o, claude-sonnet-4-6) that can navigate to sorted pages. The start_url IS the sorted search URL — a capable agent should visit it and trigger GT collection with the correct sort order.

(b) Manual GT verification with real API data: Call the OL API directly for 2-3 authors with sort=editions, inject the response into a test as realistic cache data, and run get_ground_truth(). This verifies the computation path without needing a full agent run. Example:

# Fetch real data
curl "https://openlibrary.org/search.json?q=author%3A%22agatha+christie%22&sort=editions&limit=10"
# Inject into test, verify GT returns a concrete value

(c) Run eval.py with the agent forced to use start_url — if there's a config to ensure the agent starts from the provided URL rather than navigating from scratch, that would naturally produce sorted page visits.

Everything else is solid

  • Sort enforcement: ✅ verified by eval
  • GT collection pipeline: ✅ +25 works (search) on every visit
  • Engagement metric visibility: ✅ confirmed in accessibility tree
  • Error diagnostics: ✅ clear rejection messages
  • Template design: ✅ all previous issues resolved

Status

The design is approved. One verified GT computation away from final approval.

Fetch live data (March 26, 2026) for Agatha Christie, Stephen King,
and Neil Gaiman via sort=editions search API.  Inject into GT collector
and verify all three templates (T96/T97/T98) return concrete values
with both want_to_read_count and ratings_count metrics.

12 tests cover: highest/lowest extrema, cross-author numeric difference,
and threshold counting — satisfying CLAUDE.md §5 item 1.
@MkDev11
Copy link
Contributor Author

MkDev11 commented Mar 26, 2026

GT Computation Verified with Real OL API Data

Per @angosr's request for at least 1 GT success per template — used option (b): fetched real OL API data and verified GT returns concrete values.

Data Source

Live API calls on March 26, 2026:

curl "https://openlibrary.org/search.json?q=author%3A%22agatha+christie%22&sort=editions&limit=10&fields=key,title,want_to_read_count,ratings_count,edition_count"

Authors: Agatha Christie, Stephen King, Neil Gaiman — all 10 works per author have both ratings_count and want_to_read_count present.

Results: 12/12 GT computations return concrete values ✅

Template Test GT Value Metric
T96 highest wtr, Christie top 5 Murder on the Orient Express wtr=1355
T96 highest rc, Christie top 5 Murder on the Orient Express rc=116
T96 lowest wtr, King top 5 'Salem's Lot wtr=1349
T96 highest rc, Gaiman top 3 Coraline rc=196
T96 highest wtr, King top 7 It wtr=10362
T97 wtr diff, Christie vs King top 5 16121 abs(3309-19430)
T97 rc diff, Christie vs Gaiman top 3 200 abs(139-339)
T97 wtr diff, King vs Gaiman top 3 1945 abs(6194-4249)
T98 wtr > 500, Christie top 5 3 3 of 5 books match
T98 rc > 50, King top 5 5 all 5 books match
T98 wtr > 1000, Gaiman top 10 2 2 of 10 match
T98 rc > 100, Gaiman top 5 2 2 of 5 match

What this verifies

  • CLAUDE.md §5 item 1: GT returns concrete values (not errors) for all templates
  • Real API field structure: key, title, ratings_count, want_to_read_count fields match what the GT collector parses
  • safe_metric_value works with real data: both want_to_read_count (0-defaultable) and ratings_count (strict) parse correctly when present
  • Extrema logic (T96): correctly finds highest/lowest across N works with tie-breaking
  • Comparison sums (T97): correctly sums metrics across two authors and computes absolute difference
  • Threshold counting (T98): correctly counts works exceeding threshold for both metrics
  • ✅ Full suite: 384 tests pass (372 existing + 12 new)

Commit: 7c9097ftest(openlibrary): verify GT computation with real OL API data

Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final Review (4th pass): PR #13 — APPROVE

All blocking issues resolved. GT computation verified end-to-end with real data.

Verification summary

Issue Status Evidence
allow_unsorted_fallback ✅ Removed All template GT calls use strict sort matching
T97 binary choice (Red Team §5) ✅ Fixed Redesigned to absolute numeric difference
safe_metric_value ratings_count ✅ Strict Missing ratings_count → ValueError → GT fail
GT computation (CLAUDE.md §5.1) ✅ Verified 12/12 GT computations with real OL API data return concrete values
Version 7 coordination ✅ Done Comment directs PR #14 to Version 8

Cross-verification

I independently verified the GT results against my own earlier API calls:

  • T96: Murder on the Orient Express has wtr=1355 → correctly identified as highest among Christie top 5 ✅
  • T97: abs(3309 - 19430) = 16121 — arithmetic confirmed ✅
  • T98: Christie top 5 with wtr>500 = {Styles(620), Orient Express(1355), Ackroyd(699)} = 3 ✅

Note on eval.py

The eval.py run (gpt-4o-mini, 0/9 GT success) confirmed sort enforcement and data collection work correctly. The 0% agent success rate is a model capability issue — gpt-4o-mini doesn't translate "sorted by most editions" into &sort=editions. The real API GT verification (option b) adequately demonstrates the computation pipeline works. This is sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants