refactor(hackernews): replace openmeteo expansion with gap-filling templates#12
refactor(hackernews): replace openmeteo expansion with gap-filling templates#12kiannidev wants to merge 2 commits intoAffineFoundation:mainfrom
Conversation
This adds richer forecast reasoning tasks, shared GT extraction helpers, registry integration, and broad deterministic test coverage to increase evaluation depth and template diversity. Made-with: Cursor
|
Hi, maintainers. |
angosr
left a comment
There was a problem hiding this comment.
Review: PR #12 — REJECT on direction (Significance Gate failure)
This PR fails the Significance Gate — do not iterate on details.
1. Duplicates existing capability dimensions, fills zero gaps
CLAUDE.md Evaluation Value table identifies these gaps:
- Time-sensitive events ❌
- Nested structure navigation ❌
- Search-driven interaction ❌
- User-generated content ❌
This PR adds 6 more templates to OpenMeteo (weather numerical computation) — a dimension already fully covered by templates 85-88, with PR #14 adding 3 more. None of the 6 templates address any gap:
| PR #12 Template | Capability | Already covered by |
|---|---|---|
daily_range (max−min temp) |
Single-page arithmetic | T87 (hourly_extrema), T88 (forecast_trend) |
precip_window_count (sliding window) |
Threshold counting | PR #14 T99 (hourly_threshold) |
humidity_band_hours (count hours in band) |
Threshold counting | PR #14 T99 (hourly_threshold) — almost identical |
wind_shift (max consecutive Δ) |
Hourly scan + arithmetic | T87 (hourly_extrema) variant |
city_pair_forecast_gap (two-city compare) |
Cross-city comparison | T86 (comparison) — same dimension |
comfort_index (formula from 3 fields) |
Derived metric computation | See issue #2 below |
Adding 6 templates in a covered dimension while 4 gap dimensions remain empty is the wrong priority.
2. comfort_index has a fundamental design flaw
The template asks the agent to compute CI = T - 0.2W - 0.05H — a formula that does NOT exist on the Open-Meteo website. The agent must:
- Read temperature, wind speed, humidity from the page
- Apply an arbitrary formula the question defines
This tests arithmetic ability, not web interaction ability. An LLM that reads the three values from the question + makes up plausible numbers could score well. The "comfort index" is not a real metric on Open-Meteo — it's a synthetic computation injected by the template.
CLAUDE.md Template Design §2 (Verifiability): "API response and website display must share the same data source." The comfort index has no data source — it's computed by the template.
3. Template ID conflict
IDs 96-101 conflict with both PR #13 (96-98, OpenLibrary) and PR #14 (99-101, OpenMeteo). This PR was created before either, but the IDs must be coordinated.
4. No Red Team Review, no eval.py
- Zero Red Team analysis for any of the 6 templates
- No eval.py results
- Single Cursor-generated commit with 1808 lines — no evidence of iterative design or manual verification
5. Unrelated scope change: lazy-loading BrowserEngine
The __init__.py change (lazy-load BrowserEngine/BrowserSession) is unrelated to the templates and should not be bundled.
Recommendation
Close this PR. If the author wants to contribute OpenMeteo templates, focus on a capability dimension that is NOT already covered — or contribute templates for an entirely different website that fills a gap (time-sensitive events, nested navigation, search-driven interaction, user-generated content).
|
Hi, @angosr |
Summary
This PR now focuses on gap-filling capability coverage in Hacker News, replacing the earlier OpenMeteo-heavy direction.
Final scope includes:
Added 4 new HackerNews templates aimed at uncovered dimensions:
hackernews_recent_burst_count(time-sensitive events)hackernews_comment_tree_focus(nested structure navigation)hackernews_keyword_scan_rank(search-driven interaction)hackernews_user_karma_gap(user-generated content via user profiles)Added HackerNews
/newestdata support:get_new_stories()in API clientfetch_newest_api_data()Added shared helper utilities for robust GT extraction in advanced HackerNews templates.
Added dedicated test suite:
tests/test_hackernews_gap_templates.pyRemoved previously added OpenMeteo expansion that duplicated covered dimensions:
Updated TaskRegistry with non-conflicting IDs for the new HackerNews templates:
110,111,112,113What changed over the full PR timeline
This PR originally started with additional OpenMeteo templates and broad test additions, then was reworked after review feedback to align with significance priorities.
The final result is a directional pivot to HackerNews templates that target capability gaps, with OpenMeteo expansion removed.
Why this improves evaluation value
The new templates are designed to cover capabilities that were previously weakly represented:
Validation
Ran local tests for final scope and regression confidence:
PYTHONPATH=. pytest -q tests/test_hackernews_gap_templates.pyPYTHONPATH=. pytest -q tests/test_openmeteo_integration.py tests/test_arxiv_integration.pyPYTHONPATH=. pytest -q tests/plugins/taostats/test_api_client.py tests/plugins/taostats/test_empty_subnet_name.pyAll passed locally.