feat: stabilize liveweb arena eval execution by wangtong10086 · Pull Request #8 · AffineFoundation/liveweb-arena

wangtong10086 · 2026-03-14T02:50:14Z

Summary

stabilize eval execution and supporting core flows
bypass proxy for local/private LLM endpoints
add regression coverage for llm client proxy handling

Testing

env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync python -m compileall liveweb_arena/utils/llm_client.py tests/core/test_llm_client.py
uv run pytest tests/core/test_llm_client.py could not run in the sandbox because dependency installation required blocked network access

…COUNTS Address PR review AffineFoundation#8: 1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main to preserve author_editions reproducibility. Create separate ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98. 2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to avoid missing-as-zero domination of want_to_read_count at high work_counts (41% of authors affected at work_count=25). 3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py. Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.

* refactor(openlibrary): extract author-search helpers to common.py Move normalize_author_fragment, extract_author_filter, and find_author_search_entry from author_editions.py class methods into common.py as module-level functions. This eliminates duplication for upcoming author-based templates that need the same lookup logic. * feat(openlibrary): add author_engagement_extrema template (ID 96) Find the book with the highest/lowest engagement metric among an author's top N search results. Uses confirmed-visible fields only: want_to_read_count, already_read_count, ratings_count. Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680. * feat(openlibrary): add author_comparison template (ID 97) Compare aggregate engagement metrics between two authors' top N search results. Requires two separate author searches and cross-page comparison. Variant space: C(70,2) × 3 metrics × 2 counts = 14,490. * feat(openlibrary): add reading_stats_filter template (ID 98) Count books in an author's catalog meeting an engagement threshold. Requires scanning each book's metric against a threshold — cannot be solved by sorting a single column. Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680. * test(openlibrary): add tests for engagement & comparison templates 56 tests covering: - Template registration and generation invariants - author_engagement_extrema GT: highest/lowest, tie-breaking, missing data - author_comparison GT: higher total, reverse winner, tie, missing author - reading_stats_filter GT: threshold counting, zero matches, exact boundary - Task registry wiring (IDs 96, 97, 98, Version 7) - Shared helper refactoring (common.py functions) - Cross-template consistency (serialization, GT source, cache source) * fix: accept plain-text author queries in find_author_search_entry * fix(openlibrary): reduce live GT not_collected for author templates * docs(pr): update description * fix: address PR #13 review — remove broken authors, drop already_read_count, clean up BLOCKING fixes: - Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results: Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman, Dickinson, Tagore). Pool: 70 → 61. - Remove already_read_count from EngagementMetric, AuthorMetric, and ReaderMetric enums — not visible on search results page (only want_to_read and ratings counts are rendered). NON-BLOCKING fixes: - Add comment in author_editions.py documenting allow_unsorted_fallback asymmetry between existing and new templates. - Remove pr_description.md from repository. Tests updated to reflect metric and pool changes. 106 passed. * fix: treat missing engagement metrics as 0 instead of hard-failing The OL API omits count fields (ratings_count, want_to_read_count) when the value is zero, rather than returning 0. Previously the GT methods returned GroundTruthResult.fail() for missing fields, causing hard failures for works that simply haven't been rated yet. Now treats absent metrics as 0.0, which is semantically correct and consistent with how the OL API represents zero-count data. This prevents GT failures for individual works missing ratings_count even among authors that generally have good data coverage. Also fixes _make_search_entry type hint (sort: Optional[str]) and removes unused title variables flagged by ruff. * fix: handle non-numeric metric values without TypeError If a metric field contains a non-numeric string like 'N/A', parse_numeric() returns None. Previously this None was passed to int(value) or numeric comparisons, causing a TypeError at runtime. Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None. This covers both absent fields (raw is None) and non-numeric strings (parse_numeric returns None). Adds regression test for 'N/A' metric values. * refactor: extract safe_metric_value helper to reduce duplication The 3-line metric normalization pattern (raw → parse_numeric → fallback to 0.0) was duplicated across all 3 new templates. Extracted to safe_metric_value() in common.py, reducing each call site to a single line and ensuring consistent handling of absent/non-numeric fields. * fix: drop ratings_count from all templates, fail on non-numeric data BLOCKING: ratings_count is missing for 56% of authors in the OL API, causing wrong GT for extrema-lowest queries (missing-as-zero always wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and ReaderMetric — all templates now use only want_to_read_count. Expanded RESULT_COUNTS to keep variant space above 500 minimum: - T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610 - T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660 - T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732 NON-BLOCKING: safe_metric_value now raises ValueError on non-null non-numeric values (e.g. 'N/A') instead of silently treating them as 0. Missing (None) values still default to 0. Callers catch ValueError and surface it as GroundTruthResult.fail(). * fix: docstring drift and add non-numeric regression tests for comparison/filter - Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py that still mentioned 'ratings' after ratings_count was dropped. - Add non-numeric metric regression tests for comparison and filter templates to match the existing extrema test, ensuring all 3 safe_metric_value call sites are explicitly tested for ValueError handling. * fix: restore ratings_count with targeted exclusions for anti-memorization BLOCKING: With a single metric (want_to_read_count), the entire answer space was enumerable from 61 API calls (~5,000 entries). Restoring ratings_count as a second metric dimension breaks trivial enumeration. Changes: - Remove 5 authors with worst ratings_count coverage (Emerson, Joyce, Melville, Hawthorne, P.K. Dick). Pool: 61 → 56. - Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric. - T96: exclude ratings_count from extrema=lowest only (where missing-as-zero would always win). Highest/comparison/filter are unaffected by the bias. - T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values). - Restore THRESHOLDS for ratings_count in T98. Variant spaces (all >1000): - T96: 56 × (highest×2 + lowest×1) × 6 = 1,008 - T97: C(56,2) × 2 × 2 = 6,160 - T98: 56 × 2 × 4 × 3 = 1,344 Adds test_extrema_lowest_excludes_ratings_count to verify the per-extrema metric filtering. 364 tests pass. * fix(openlibrary): expand AUTHOR_POOL and RESULT_COUNTS for T96 variant space - Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization - Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25] to increase lowest-extrema differentiation - Effective variant space: ~583 (16.6% margin above 500 threshold) - Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants - Fix AUTHOR_POOL section comments to reflect actual counts - Split test file (481+490 lines, both <500) - Remove unused get_registered_templates import - Add tests: pool size=81, no duplicates, ratings_count GT * fix(openlibrary): raise search fetch limit to 25 for T96 work_count=25 The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing guaranteed GT failure for 1/7 of T96 variants. Raise limit to match. Add regression test: test_extrema_gt_succeeds_with_25_works * fix(openlibrary): separate ENGAGEMENT_AUTHOR_POOL, cap lowest RESULT_COUNTS Address PR review #8: 1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main to preserve author_editions reproducibility. Create separate ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98. 2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to avoid missing-as-zero domination of want_to_read_count at high work_counts (41% of authors affected at work_count=25). 3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py. Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants. * fix(openlibrary): address PR #13 review — deterministic GT, numeric T97, strict metrics BLOCKING fixes: - Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98). GT now strictly requires sort=editions data, matching the question text. If the agent doesn't visit the sorted page, GT correctly returns not_collected instead of silently using wrong-order results. - Make safe_metric_value fail on missing ratings_count instead of defaulting to 0. Only want_to_read_count (high API coverage) defaults to 0 when absent. ratings_count absence raises ValueError → GT fail, preventing semantically wrong answers from sparse data. - Redesign T97 (author_comparison) from binary "which author has more?" (50% random baseline) to numeric "what is the absolute difference?" (near-0% random baseline). GT returns str(abs(sum_a - sum_b)). - Add Version 7 coordination comment for PR #14 (IDs 99-101 → Version 8). NON-BLOCKING fixes: - Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set + additions list, eliminating 56-entry duplication and preventing drift. AUTHOR_POOL itself is unchanged (author_editions reproducibility). - Remove stale allow_unsorted_fallback asymmetry comment from author_editions.py (all templates now consistently use strict sort). Tests: 372 passed (118 OpenLibrary, 254 other). * fix(openlibrary): cap ratings_count variants to low N to reduce GT-fail from sparse OL data ratings_count is missing for 20-40% of authors at N≥7. Restrict ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where coverage is highest, cutting estimated GT-fail exposure from ~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged. * test(openlibrary): verify GT computation with real OL API data Fetch live data (March 26, 2026) for Agatha Christie, Stephen King, and Neil Gaiman via sort=editions search API. Inject into GT collector and verify all three templates (T96/T97/T98) return concrete values with both want_to_read_count and ratings_count metrics. 12 tests cover: highest/lowest extrema, cross-author numeric difference, and threshold counting — satisfying CLAUDE.md §5 item 1. --------- Co-authored-by: mkdev11 <[email protected]>

xmyf added 22 commits March 14, 2026 03:37

feat: stabilize liveweb arena eval execution

ac2d5f0

feat: update .gitignore to include .cache directory

2209b06

fix: bypass proxy for local llm endpoints

5349b48

fix: harden browser lifecycle for rl rollouts

b85717a

fix: harden liveweb protocol and site handling

698516a

fix: retry normalized openlibrary search queries

e80c114

fix: retry transient taostats api fetches

1d6bf13

fix: make batch eval incremental and deterministic

5b81a32

feat: add recoverable liveweb format retries

632d3e4

feat: tune liveweb format recovery sampling

9827b57

feat: audit liveweb reachability and recovery failures

71ae7bf

fix: restore cache manager compatibility helper

69570ad

fix: bound recovery context and refine browser audit

e30da8f

Stabilize browser and cache handling for noisy data sites

99a826a

Improve protocol parsing and LLM failure diagnostics

bef5001

Add structured attribution for blocked domains and taostats failures

76afd0b

Disable Kimi reasoning in OpenRouter requests

171826d

Stabilize taostats cache setup and failure attribution

c24bcf8

Harden taostats list actions and UI target attribution

aef7546

Fail fast on repeated disallowed-domain navigation

17701c5

Split strict eval from fast collection runtime

63a2427

Harden LiveWeb recovery and browser proxy flow

c5c3594

Add think ablation experiment tooling

77d3085

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stabilize liveweb arena eval execution#8

feat: stabilize liveweb arena eval execution#8
wangtong10086 wants to merge 23 commits intoAffineFoundation:audit-fixes-2026-03-13from
wangtong10086:codex/liveweb-arena-stability-20260314

wangtong10086 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangtong10086 commented Mar 14, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant