refactor(hackernews): replace openmeteo expansion with gap-filling templates by kiannidev · Pull Request #12 · AffineFoundation/liveweb-arena

kiannidev · 2026-03-20T18:39:30Z

Summary

This PR now focuses on gap-filling capability coverage in Hacker News, replacing the earlier OpenMeteo-heavy direction.

Final scope includes:

Added 4 new HackerNews templates aimed at uncovered dimensions:
- hackernews_recent_burst_count (time-sensitive events)
- hackernews_comment_tree_focus (nested structure navigation)
- hackernews_keyword_scan_rank (search-driven interaction)
- hackernews_user_karma_gap (user-generated content via user profiles)
Added HackerNews /newest data support:
- get_new_stories() in API client
- fetch_newest_api_data()
- plugin routing updated to use real newest feed handling
Added shared helper utilities for robust GT extraction in advanced HackerNews templates.
Added dedicated test suite:
- tests/test_hackernews_gap_templates.py
Removed previously added OpenMeteo expansion that duplicated covered dimensions:
- removed 6 new OpenMeteo templates
- removed corresponding OpenMeteo expansion tests
- removed conflicting template IDs in that range
Updated TaskRegistry with non-conflicting IDs for the new HackerNews templates:
- 110, 111, 112, 113

What changed over the full PR timeline

This PR originally started with additional OpenMeteo templates and broad test additions, then was reworked after review feedback to align with significance priorities.
The final result is a directional pivot to HackerNews templates that target capability gaps, with OpenMeteo expansion removed.

Why this improves evaluation value

The new templates are designed to cover capabilities that were previously weakly represented:

Time-window reasoning on newest story timestamps
Story → item nested comment structure checks
Keyword scanning across ranked newest titles
Author/user profile navigation and karma comparison

Validation

Ran local tests for final scope and regression confidence:

PYTHONPATH=. pytest -q tests/test_hackernews_gap_templates.py
PYTHONPATH=. pytest -q tests/test_openmeteo_integration.py tests/test_arxiv_integration.py
PYTHONPATH=. pytest -q tests/plugins/taostats/test_api_client.py tests/plugins/taostats/test_empty_subnet_name.py

All passed locally.

This adds richer forecast reasoning tasks, shared GT extraction helpers, registry integration, and broad deterministic test coverage to increase evaluation depth and template diversity. Made-with: Cursor

kiannidev · 2026-03-24T23:13:59Z

Hi, maintainers.
Please review the PR and give me some feedback.
Thanks

angosr

Review: PR #12 — REJECT on direction (Significance Gate failure)

This PR fails the Significance Gate — do not iterate on details.

1. Duplicates existing capability dimensions, fills zero gaps

CLAUDE.md Evaluation Value table identifies these gaps:

Time-sensitive events ❌
Nested structure navigation ❌
Search-driven interaction ❌
User-generated content ❌

This PR adds 6 more templates to OpenMeteo (weather numerical computation) — a dimension already fully covered by templates 85-88, with PR #14 adding 3 more. None of the 6 templates address any gap:

PR #12 Template	Capability	Already covered by
`daily_range` (max−min temp)	Single-page arithmetic	T87 (hourly_extrema), T88 (forecast_trend)
`precip_window_count` (sliding window)	Threshold counting	PR #14 T99 (hourly_threshold)
`humidity_band_hours` (count hours in band)	Threshold counting	PR #14 T99 (hourly_threshold) — almost identical
`wind_shift` (max consecutive Δ)	Hourly scan + arithmetic	T87 (hourly_extrema) variant
`city_pair_forecast_gap` (two-city compare)	Cross-city comparison	T86 (comparison) — same dimension
`comfort_index` (formula from 3 fields)	Derived metric computation	See issue #2 below

Adding 6 templates in a covered dimension while 4 gap dimensions remain empty is the wrong priority.

2. `comfort_index` has a fundamental design flaw

The template asks the agent to compute CI = T - 0.2W - 0.05H — a formula that does NOT exist on the Open-Meteo website. The agent must:

Read temperature, wind speed, humidity from the page
Apply an arbitrary formula the question defines

This tests arithmetic ability, not web interaction ability. An LLM that reads the three values from the question + makes up plausible numbers could score well. The "comfort index" is not a real metric on Open-Meteo — it's a synthetic computation injected by the template.

CLAUDE.md Template Design §2 (Verifiability): "API response and website display must share the same data source." The comfort index has no data source — it's computed by the template.

3. Template ID conflict

IDs 96-101 conflict with both PR #13 (96-98, OpenLibrary) and PR #14 (99-101, OpenMeteo). This PR was created before either, but the IDs must be coordinated.

4. No Red Team Review, no eval.py

Zero Red Team analysis for any of the 6 templates
No eval.py results
Single Cursor-generated commit with 1808 lines — no evidence of iterative design or manual verification

5. Unrelated scope change: lazy-loading BrowserEngine

The __init__.py change (lazy-load BrowserEngine/BrowserSession) is unrelated to the templates and should not be bundled.

Recommendation

Close this PR. If the author wants to contribute OpenMeteo templates, focus on a capability dimension that is NOT already covered — or contribute templates for an entirely different website that fills a gap (time-sensitive events, nested navigation, search-driven interaction, user-generated content).

kiannidev · 2026-03-26T12:26:39Z

Hi, @angosr
I’ve reworked the PR direction based on your Significance Gate feedback.
Please check it again.
Thanks

Expand OpenMeteo task coverage with six new high-variance templates.

74ce7bd

This adds richer forecast reasoning tasks, shared GT extraction helpers, registry integration, and broad deterministic test coverage to increase evaluation depth and template diversity. Made-with: Cursor

angosr requested changes Mar 26, 2026

View reviewed changes

update from feedback

6d59593

kiannidev changed the title ~~feat(openmeteo): add six advanced templates and expanded GT test coverage~~ feat(hackernews): add gap-filling templates for time, nested, search, and UGC Mar 26, 2026

kiannidev changed the title ~~feat(hackernews): add gap-filling templates for time, nested, search, and UGC~~ refactor(hackernews): replace openmeteo expansion with gap-filling templates Mar 26, 2026

kiannidev requested a review from angosr March 26, 2026 12:26

angosr mentioned this pull request Mar 27, 2026

feat: add template red team dashboard CLI #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(hackernews): replace openmeteo expansion with gap-filling templates#12

refactor(hackernews): replace openmeteo expansion with gap-filling templates#12
kiannidev wants to merge 2 commits intoAffineFoundation:mainfrom
kiannidev:feat/openmeteo-expanded-templates

kiannidev commented Mar 20, 2026 •

edited

Loading

Uh oh!

kiannidev commented Mar 24, 2026

Uh oh!

angosr left a comment

Uh oh!

kiannidev commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kiannidev commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed over the full PR timeline

Why this improves evaluation value

Validation

Uh oh!

kiannidev commented Mar 24, 2026

Uh oh!

angosr left a comment

Choose a reason for hiding this comment

Review: PR #12 — REJECT on direction (Significance Gate failure)

This PR fails the Significance Gate — do not iterate on details.

1. Duplicates existing capability dimensions, fills zero gaps

2. comfort_index has a fundamental design flaw

3. Template ID conflict

4. No Red Team Review, no eval.py

5. Unrelated scope change: lazy-loading BrowserEngine

Recommendation

Uh oh!

kiannidev commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiannidev commented Mar 20, 2026 •

edited

Loading

2. `comfort_index` has a fundamental design flaw