feat(v9): T110–T113 OpenMeteo + Open Library + arXiv templates, task_registry, RED_TEAM + integration tests, cache expiry file delete#12
Conversation
This adds richer forecast reasoning tasks, shared GT extraction helpers, registry integration, and broad deterministic test coverage to increase evaluation depth and template diversity. Made-with: Cursor
|
Hi, maintainers. |
angosr
left a comment
There was a problem hiding this comment.
Review: PR #12 — REJECT on direction (Significance Gate failure)
This PR fails the Significance Gate — do not iterate on details.
1. Duplicates existing capability dimensions, fills zero gaps
CLAUDE.md Evaluation Value table identifies these gaps:
- Time-sensitive events ❌
- Nested structure navigation ❌
- Search-driven interaction ❌
- User-generated content ❌
This PR adds 6 more templates to OpenMeteo (weather numerical computation) — a dimension already fully covered by templates 85-88, with PR #14 adding 3 more. None of the 6 templates address any gap:
| PR #12 Template | Capability | Already covered by |
|---|---|---|
daily_range (max−min temp) |
Single-page arithmetic | T87 (hourly_extrema), T88 (forecast_trend) |
precip_window_count (sliding window) |
Threshold counting | PR #14 T99 (hourly_threshold) |
humidity_band_hours (count hours in band) |
Threshold counting | PR #14 T99 (hourly_threshold) — almost identical |
wind_shift (max consecutive Δ) |
Hourly scan + arithmetic | T87 (hourly_extrema) variant |
city_pair_forecast_gap (two-city compare) |
Cross-city comparison | T86 (comparison) — same dimension |
comfort_index (formula from 3 fields) |
Derived metric computation | See issue #2 below |
Adding 6 templates in a covered dimension while 4 gap dimensions remain empty is the wrong priority.
2. comfort_index has a fundamental design flaw
The template asks the agent to compute CI = T - 0.2W - 0.05H — a formula that does NOT exist on the Open-Meteo website. The agent must:
- Read temperature, wind speed, humidity from the page
- Apply an arbitrary formula the question defines
This tests arithmetic ability, not web interaction ability. An LLM that reads the three values from the question + makes up plausible numbers could score well. The "comfort index" is not a real metric on Open-Meteo — it's a synthetic computation injected by the template.
CLAUDE.md Template Design §2 (Verifiability): "API response and website display must share the same data source." The comfort index has no data source — it's computed by the template.
3. Template ID conflict
IDs 96-101 conflict with both PR #13 (96-98, OpenLibrary) and PR #14 (99-101, OpenMeteo). This PR was created before either, but the IDs must be coordinated.
4. No Red Team Review, no eval.py
- Zero Red Team analysis for any of the 6 templates
- No eval.py results
- Single Cursor-generated commit with 1808 lines — no evidence of iterative design or manual verification
5. Unrelated scope change: lazy-loading BrowserEngine
The __init__.py change (lazy-load BrowserEngine/BrowserSession) is unrelated to the templates and should not be bundled.
Recommendation
Close this PR. If the author wants to contribute OpenMeteo templates, focus on a capability dimension that is NOT already covered — or contribute templates for an entirely different website that fills a gap (time-sensitive events, nested navigation, search-driven interaction, user-generated content).
|
Hi, @angosr |
angosr
left a comment
There was a problem hiding this comment.
Re-review: PR #12 — Direction improved, but all 4 templates fail Red Team Check 3
Significance Gate: PASS
The pivot from OpenMeteo duplication to HackerNews gap-filling templates is the right direction. The 4 targeted capability gaps (time-sensitive events, nested navigation, search interaction, user-generated content) align with CLAUDE.md's evaluation value table.
However, all 4 templates fail the mandatory >500 variant space requirement (Red Team Check 3), and several have additional design issues.
BLOCKING: All templates fail Red Team Check 3 — Memorization Space
CLAUDE.md Red Team §3: "Effective variant space must be >500."
| Template | Parameters | Effective Variants | Minimum Required |
|---|---|---|---|
| T110 (burst_count) | 4 windows × 3 story_counts | 12 | 500 |
| T111 (comment_tree) | 5 ranks | 5 | 500 |
| T112 (keyword_scan) | 8 keywords × 3 spans | 24 | 500 |
| T113 (karma_gap) | 4 rank pairs | 4 | 500 |
These are 1-2 orders of magnitude below the threshold. An SFT model can enumerate all Q&A pairs for T111 (5 variants) and T113 (4 variants) trivially.
Fix: Expand parameter pools significantly. For example:
- T110: Add more window sizes + use story_count as a range (5-50) + add keyword filters → hundreds of combos
- T111: Expand rank range to 1-30, add metric dimension (top-level comments, total descendants, score)
- T112: Expand keyword list to 50+ terms, add case-sensitivity/partial-match variants
- T113: Use any two ranks from 1-30 → C(30,2) = 435 pairs; add metric choices beyond karma (created date diff, story count comparison) to exceed 500
BLOCKING: T112 (keyword_scan) doesn't test "search-driven interaction"
The template scans titles on the /newest page for a keyword. This is title string matching on a list page, not search-driven interaction. The HN website has an Algolia-powered search (hn.algolia.com/search?q=...). A true search-driven template would require the agent to use the search functionality.
Verified via live API: keyword "rust" matches 0/30 newest titles, "python" matches 0/30, "cloud" matches 0/30. For 3 of 8 keywords, the answer is likely always "NONE" — violating Red Team Check 6 (cross-parameter collapse).
BLOCKING: T111 (comment_tree) is EASY difficulty, not "nested navigation"
T111 asks "how many top-level comments" for a story. The agent:
- Visits
/newest - Clicks a story
- Counts visible comments
This is a single-hop, single-value read — EASY difficulty per CLAUDE.md §4. The "nested structure" in HN comments (replies, threads, depth) is not tested. A genuine nested navigation template would require traversing comment depth, finding deeply nested replies, or comparing subtree metrics.
BLOCKING: Version 7 conflict
Both this PR ([110-113] as Version 7) and PR #13 ([96-98] as Version 7, already approved) claim Version 7. With PR #14 using Version 8, this should use Version 9 or higher.
BLOCKING: No Red Team Review, no eval.py, no real API GT verification
CLAUDE.md requires all 6 Red Team checks documented with concrete data, plus eval.py or real API GT verification (as PR #13 and #14 demonstrated).
What's good
- Direction: Targeting gap dimensions is correct and addresses the original rejection reason.
fetch_newest_api_data: Clean implementation, properly routes/newestseparately from homepage.- GT logic: The GT methods are well-structured (proper error handling,
not_collectedvsfaildistinction). - Test coverage: 256-line test file with good coverage of the new helpers.
Required Actions
- Expand variant spaces to >500 for all 4 templates (see suggestions above)
- Redesign T112 to use actual HN search (Algolia), not title scanning
- Redesign T111 to require actual nested structure traversal, not single-value comment count
- Fix Version to avoid conflict with merged/approved PRs
- Document Red Team 6-check review with concrete data
- Add real API GT verification tests (following PR #13/14 pattern)
Resolve task registry version conflict by preserving upstream Version 7/8 entries and moving HackerNews gap templates to Version 9. Made-with: Cursor
…n and verification evidence
|
Thanks for the re-review — I fixed all blocking points:
|
angosr
left a comment
There was a problem hiding this comment.
Re-review (3rd pass): PR #12 — Significant improvement, one remaining issue
Resolved issues
-
Variant spaces expanded ✅
- T110: 10×10×5×2 = 1,000 (was 12)
- T111: 30×5×4 = 600 (was 5)
- T112: 52×20×4×4 = 16,640 (was 24)
- T113: C(30,2)×3 = 1,305 (was 4)
All now exceed the 500 minimum.
-
T112 redesigned to use Algolia search ✅ — Now queries
hn.algolia.comwith configurable queries, rank extraction, field selection, and point filtering. Genuinely tests search-driven interaction. -
T111 redesigned for nested traversal ✅ — GT now walks the comment tree with depth threshold and computes
nodes,leaf_nodes,branch_nodes,max_depth. This is real nested structure navigation. -
Red Team Review documented ✅ — All 6 checks with concrete data in
RED_TEAM_REVIEW_GAP_TEMPLATES.md. -
Real API GT verification ✅ —
test_gap_templates_real_api_data.pyadded.
Remaining BLOCKING: Version 7 conflict
This PR registers [110-113] as "Version 7". But PR #13 (already approved and likely merging soon) also claims Version 7 for [96-98], and PR #14 claims Version 8 for [99-101].
Fix: Use Version 9 or higher for [110-113].
Non-blocking: T111 GT requires item-level API data for comment children
The comment tree traversal calls get_item_story(collected, cid) for each child comment. This requires the agent to have visited each comment's item page for the data to be collected. In practice, the GT collector may only have the top-level story's kids list — not the full recursive tree. Verify this works with eval.py or document the expected agent navigation path.
Assessment
The directional pivot and template redesigns are substantial and well-executed. One version conflict fix away from approval.
|
Thanks for the 3rd-pass review — I’ve addressed the remaining points. |
angosr
left a comment
There was a problem hiding this comment.
Final Review (4th pass): PR #12 — APPROVE
All blocking issues resolved.
Resolved
-
Version 9 ✅ —
# Version 9: Hacker News gap-filling templatesconfirmed in diff. No conflict with PR #13 (V7) or PR #14 (V8). -
Comment tree GT collection ✅ —
gt_collector.pynow merges_comment_itemsfrom item detail pages, making child comment data available to T111's depth-based tree traversal. The item merge condition also relaxed from"id" in api_data and "title" in api_datato"id" in api_datato include comments (which may lack titles). -
Algolia search data collection ✅ — New
hn_search:{query}:{page}key pattern in GT collector for T112's search-based template.
Summary of what was verified across all review rounds
| Check | T110 (burst) | T111 (tree) | T112 (search) | T113 (karma) |
|---|---|---|---|---|
| Variant space >500 | 1,000 ✅ | 600 ✅ | 16,640 ✅ | 1,305 ✅ |
| Gap dimension | Time-sensitive ✅ | Nested nav ✅ | Search-driven ✅ | User content ✅ |
| Red Team 6 checks | Documented ✅ | Documented ✅ | Documented ✅ | Documented ✅ |
| Real API GT tests | Present ✅ | Present ✅ | Present ✅ | Present ✅ |
| Version conflict | V9 ✅ | V9 ✅ | V9 ✅ | V9 ✅ |
| GT collector support | Existing ✅ | _comment_items ✅ | hn_search key ✅ | user key ✅ |
angosr
left a comment
There was a problem hiding this comment.
Review WITHDRAWN — PR #12 rejected on deeper analysis
Previous approval is retracted. On re-examination, the 4 templates do not genuinely fill the claimed capability gaps.
BLOCKING: Templates do not test the capabilities they claim
| Template | Claimed Gap | Actual Capability Tested | Already Covered By |
|---|---|---|---|
| T110 (burst_count) | Time-sensitive events | Compare unix timestamps, count within window | T75-78 (HN numerical comparison) — same dimension |
| T111 (comment_tree) | Nested structure navigation | Count comment children from API data | Agent sees flat HTML on HN page — no tree navigation needed |
| T112 (keyword_scan) | Search-driven interaction | Use Algolia search with query given in question | Violates "NO navigation hints in questions" (CLAUDE.md §3) |
| T113 (karma_gap) | User-generated content | Read two numbers from two profile pages, subtract | T75-78 (numerical comparison) — same dimension |
T110 is not "time-sensitive events"
"Time-sensitive events" means evaluating ability to handle breaking news, deadlines, scheduled events. T110 compares unix timestamps on a list page — this is the same numerical comparison capability as existing HN templates, just using the time field instead of score or descendants.
T111's "nested navigation" is illusory from the agent's perspective
The GT logic recursively traverses the comment tree via API. But the agent sees HN comments as a flat, indented HTML list on the story page. There is no tree to "navigate" — the depth information is visual indentation, not interactive nested structure. The agent reads a rendered page, not a data structure.
Furthermore, the GT requires _comment_items to be collected for each child comment. This means each comment's individual item page must be visited — but the agent has no reason to visit /item?id=<comment_id> since comments are displayed inline on the story page.
T112 gives navigation hints in the question
CLAUDE.md Template Design §3: "NO navigation hints in questions — no URLs, symbols, selectors, or shortcuts. Finding the source is part of the test."
T112 questions say: "Use Hacker News search for 'rust'" — this tells the agent exactly what to search for. The question should test whether the agent can figure out WHERE to find the information, not just read a result from a page it's directed to. The search query IS the navigation hint.
T113 is numerical comparison, not "user-generated content"
Reading karma from /user?id=X and subtracting is a two-page numerical comparison — the same capability as T86 (CoinGecko comparison) or existing HN templates. "User-generated content" should test understanding of posts, comments, discussions — not reading a single integer from a profile page.
What would genuinely fill these gaps
- Time-sensitive events: A template on a website with event calendars, deadlines, or real-time feeds where the agent must identify what's happening NOW vs what happened before (not just comparing timestamps on a list)
- Nested structure navigation: A website with actual interactive tree structures — expandable/collapsible sections, threaded discussions with "load more replies" buttons, multi-level category hierarchies requiring clicks to traverse
- Search-driven interaction: Questions where the agent must FIGURE OUT what to search for based on the question context, not be told the query directly
- User-generated content: Questions about the semantic content of user posts/reviews/comments — not just metadata (karma, counts)
Recommendation
Close this PR. The templates are well-implemented technically (variant spaces, GT logic, tests are solid), but they test capabilities that are already covered rather than genuinely new dimensions. Contributing templates that truly fill the gaps requires choosing websites and interaction patterns that force the agent into genuinely new behavior.
|
Thanks @angosr — the withdrawn approval was fair: the earlier HN templates claimed four benchmark gaps but did not test them, and T112-style prompts risked violating CLAUDE.md §3 depending on wording. This update removes that direction and rebinds registry IDs 110–113 to cross-site templates with an explicit gap mapping + review artifacts. What changed
Reviewer-facing artifacts
ValidationPYTHONPATH=. pytest -q tests/ |
angosr
left a comment
There was a problem hiding this comment.
Review: PR #12 — REJECT on scope and coherence
BLOCKING: Uncontrolled scope expansion
This PR has been reworked 4-5 times:
- Original: 6 OpenMeteo templates (rejected — duplicates covered dimensions)
- V2: 4 HackerNews gap-filling templates (rejected — didn't genuinely fill gaps)
- V3: Now 8+ templates across 4 plugins (ArXiv, HackerNews, OpenLibrary, OpenMeteo) + core infrastructure changes (
cache.py,gt_collector.py)
Each iteration ADDS scope instead of converging. 1,327 additions across 4 plugins in a single PR is not reviewable.
BLOCKING: PR description does not match code
The title says "replace openmeteo expansion with gap-filling templates" and the body describes 4 HackerNews templates. The actual code now includes:
- 4 ArXiv templates (
category_discovery_hints,category_infer_author_filter,category_infer_title_substring,title_substring_clues) - OpenLibrary templates (
book_work_title_clues,nested_work_title_substring,subject_hub_infer) - 1 OpenMeteo template (
daily_precip_peak_day) - HackerNews common.py changes + API client changes
- Core
cache.pyandgt_collector.pymodifications
None of this is described in the PR body.
BLOCKING: Core infrastructure changes bundled with templates
cache.py and gt_collector.py are core pipeline files. Changes to these must be in separate PRs with their own justification and testing, not bundled into a template PR.
Recommendation
Close this PR and start fresh with focused, single-plugin PRs:
- One PR per plugin (e.g., "feat(arxiv): add 2 search-driven templates")
- Each PR ≤ 300 lines, with updated description matching the code
- Core infrastructure changes (
cache.py,gt_collector.py) in a separate PR - Each PR must independently pass Red Team Review and include real API GT verification
The pattern of expanding scope on each rejection is not productive. Smaller, focused PRs are easier to review and merge.
Scope (matches diff): - OpenMeteo: daily_precip_peak_day (T110) - Open Library: nested subject/work title drill-down (T111) - arXiv: category inference title substring + author filter (T112–T113) - task_registry wiring, RED_TEAM review, integration tests Infra: - fix(cache): delete expired cache file when _load_cache rejects stale entry - Remove stray liveweb_arena/plugins/hackernews/templates/common.py (not on main) No agent loop, gt_collector, taostats, or hackernews plugin changes. Made-with: Cursor
bd7debe to
d0c3b6a
Compare
|
Thanks — I’ve scoped the PR to templates + registry + tests and dropped unrelated core/plugin changes. Title/description now match the diff. There’s a one-line cache.py fix for expired-cache deletion; I can move it to a separate PR if you want zero core here. Happy to split by plugin if that’s the preferred workflow. |
Summary
This PR includes all updates currently in the branch for Version 9 task coverage:
liveweb_arena/core/task_registry.pyliveweb_arena/core/cache.py(delete expired cache file when rejected)Task Mapping
openmeteo_daily_precip_peak_dayopenlibrary_subject_nested_work_titlearxiv_category_infer_title_substringarxiv_category_infer_author_filterFiles/Areas Updated
Core
liveweb_arena/core/task_registry.py(register T110–T113)liveweb_arena/core/cache.py(delete expired cache file on reject path)OpenMeteo templates
liveweb_arena/plugins/openmeteo/templates/__init__.pyliveweb_arena/plugins/openmeteo/templates/daily_precip_peak_day.pyOpen Library templates
liveweb_arena/plugins/openlibrary/templates/__init__.pyliveweb_arena/plugins/openlibrary/templates/common.pyliveweb_arena/plugins/openlibrary/templates/book_work_title_clues.pyliveweb_arena/plugins/openlibrary/templates/nested_work_title_substring.pyliveweb_arena/plugins/openlibrary/templates/subject_hub_infer.pyarXiv templates
liveweb_arena/plugins/arxiv/templates/__init__.pyliveweb_arena/plugins/arxiv/templates/category_discovery_hints.pyliveweb_arena/plugins/arxiv/templates/category_infer_title_substring.pyliveweb_arena/plugins/arxiv/templates/category_infer_author_filter.pyliveweb_arena/plugins/arxiv/templates/title_substring_clues.pyTests / review docs
tests/test_openmeteo_integration.pytests/test_version9_cross_site_templates.pytests/RED_TEAM_REVIEW_VERSION9_CROSS_SITE.mdValidation
PYTHONPATH=. pytest -q tests/