Skip to content

refactor(hackernews): replace openmeteo expansion with gap-filling templates#12

Open
kiannidev wants to merge 2 commits intoAffineFoundation:mainfrom
kiannidev:feat/openmeteo-expanded-templates
Open

refactor(hackernews): replace openmeteo expansion with gap-filling templates#12
kiannidev wants to merge 2 commits intoAffineFoundation:mainfrom
kiannidev:feat/openmeteo-expanded-templates

Conversation

@kiannidev
Copy link

@kiannidev kiannidev commented Mar 20, 2026

Summary

This PR now focuses on gap-filling capability coverage in Hacker News, replacing the earlier OpenMeteo-heavy direction.

Final scope includes:

  • Added 4 new HackerNews templates aimed at uncovered dimensions:

    • hackernews_recent_burst_count (time-sensitive events)
    • hackernews_comment_tree_focus (nested structure navigation)
    • hackernews_keyword_scan_rank (search-driven interaction)
    • hackernews_user_karma_gap (user-generated content via user profiles)
  • Added HackerNews /newest data support:

    • get_new_stories() in API client
    • fetch_newest_api_data()
    • plugin routing updated to use real newest feed handling
  • Added shared helper utilities for robust GT extraction in advanced HackerNews templates.

  • Added dedicated test suite:

    • tests/test_hackernews_gap_templates.py
  • Removed previously added OpenMeteo expansion that duplicated covered dimensions:

    • removed 6 new OpenMeteo templates
    • removed corresponding OpenMeteo expansion tests
    • removed conflicting template IDs in that range
  • Updated TaskRegistry with non-conflicting IDs for the new HackerNews templates:

    • 110, 111, 112, 113

What changed over the full PR timeline

This PR originally started with additional OpenMeteo templates and broad test additions, then was reworked after review feedback to align with significance priorities.
The final result is a directional pivot to HackerNews templates that target capability gaps, with OpenMeteo expansion removed.

Why this improves evaluation value

The new templates are designed to cover capabilities that were previously weakly represented:

  • Time-window reasoning on newest story timestamps
  • Story → item nested comment structure checks
  • Keyword scanning across ranked newest titles
  • Author/user profile navigation and karma comparison

Validation

Ran local tests for final scope and regression confidence:

  • PYTHONPATH=. pytest -q tests/test_hackernews_gap_templates.py
  • PYTHONPATH=. pytest -q tests/test_openmeteo_integration.py tests/test_arxiv_integration.py
  • PYTHONPATH=. pytest -q tests/plugins/taostats/test_api_client.py tests/plugins/taostats/test_empty_subnet_name.py

All passed locally.

This adds richer forecast reasoning tasks, shared GT extraction helpers, registry integration, and broad deterministic test coverage to increase evaluation depth and template diversity.

Made-with: Cursor
@kiannidev
Copy link
Author

Hi, maintainers.
Please review the PR and give me some feedback.
Thanks

Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #12 — REJECT on direction (Significance Gate failure)

This PR fails the Significance Gate — do not iterate on details.


1. Duplicates existing capability dimensions, fills zero gaps

CLAUDE.md Evaluation Value table identifies these gaps:

  • Time-sensitive events ❌
  • Nested structure navigation ❌
  • Search-driven interaction ❌
  • User-generated content ❌

This PR adds 6 more templates to OpenMeteo (weather numerical computation) — a dimension already fully covered by templates 85-88, with PR #14 adding 3 more. None of the 6 templates address any gap:

PR #12 Template Capability Already covered by
daily_range (max−min temp) Single-page arithmetic T87 (hourly_extrema), T88 (forecast_trend)
precip_window_count (sliding window) Threshold counting PR #14 T99 (hourly_threshold)
humidity_band_hours (count hours in band) Threshold counting PR #14 T99 (hourly_threshold) — almost identical
wind_shift (max consecutive Δ) Hourly scan + arithmetic T87 (hourly_extrema) variant
city_pair_forecast_gap (two-city compare) Cross-city comparison T86 (comparison) — same dimension
comfort_index (formula from 3 fields) Derived metric computation See issue #2 below

Adding 6 templates in a covered dimension while 4 gap dimensions remain empty is the wrong priority.

2. comfort_index has a fundamental design flaw

The template asks the agent to compute CI = T - 0.2W - 0.05H — a formula that does NOT exist on the Open-Meteo website. The agent must:

  1. Read temperature, wind speed, humidity from the page
  2. Apply an arbitrary formula the question defines

This tests arithmetic ability, not web interaction ability. An LLM that reads the three values from the question + makes up plausible numbers could score well. The "comfort index" is not a real metric on Open-Meteo — it's a synthetic computation injected by the template.

CLAUDE.md Template Design §2 (Verifiability): "API response and website display must share the same data source." The comfort index has no data source — it's computed by the template.

3. Template ID conflict

IDs 96-101 conflict with both PR #13 (96-98, OpenLibrary) and PR #14 (99-101, OpenMeteo). This PR was created before either, but the IDs must be coordinated.

4. No Red Team Review, no eval.py

  • Zero Red Team analysis for any of the 6 templates
  • No eval.py results
  • Single Cursor-generated commit with 1808 lines — no evidence of iterative design or manual verification

5. Unrelated scope change: lazy-loading BrowserEngine

The __init__.py change (lazy-load BrowserEngine/BrowserSession) is unrelated to the templates and should not be bundled.

Recommendation

Close this PR. If the author wants to contribute OpenMeteo templates, focus on a capability dimension that is NOT already covered — or contribute templates for an entirely different website that fills a gap (time-sensitive events, nested navigation, search-driven interaction, user-generated content).

@kiannidev kiannidev changed the title feat(openmeteo): add six advanced templates and expanded GT test coverage feat(hackernews): add gap-filling templates for time, nested, search, and UGC Mar 26, 2026
@kiannidev kiannidev changed the title feat(hackernews): add gap-filling templates for time, nested, search, and UGC refactor(hackernews): replace openmeteo expansion with gap-filling templates Mar 26, 2026
@kiannidev
Copy link
Author

Hi, @angosr
I’ve reworked the PR direction based on your Significance Gate feedback.
Please check it again.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants