Skip to content

feat: stabilize liveweb arena eval execution#8

Open
wangtong10086 wants to merge 23 commits intoAffineFoundation:audit-fixes-2026-03-13from
wangtong10086:codex/liveweb-arena-stability-20260314
Open

feat: stabilize liveweb arena eval execution#8
wangtong10086 wants to merge 23 commits intoAffineFoundation:audit-fixes-2026-03-13from
wangtong10086:codex/liveweb-arena-stability-20260314

Conversation

@wangtong10086
Copy link
Copy Markdown

Summary

  • stabilize eval execution and supporting core flows
  • bypass proxy for local/private LLM endpoints
  • add regression coverage for llm client proxy handling

Testing

  • env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync python -m compileall liveweb_arena/utils/llm_client.py tests/core/test_llm_client.py
  • uv run pytest tests/core/test_llm_client.py could not run in the sandbox because dependency installation required blocked network access

MkDev11 added a commit to MkDev11/liveweb-arena that referenced this pull request Mar 25, 2026
…COUNTS

Address PR review AffineFoundation#8:

1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main
   to preserve author_editions reproducibility. Create separate
   ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98.

2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to
   avoid missing-as-zero domination of want_to_read_count at high
   work_counts (41% of authors affected at work_count=25).

3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py.

Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.
angosr pushed a commit that referenced this pull request Mar 27, 2026
* refactor(openlibrary): extract author-search helpers to common.py

Move normalize_author_fragment, extract_author_filter, and
find_author_search_entry from author_editions.py class methods into
common.py as module-level functions. This eliminates duplication for
upcoming author-based templates that need the same lookup logic.

* feat(openlibrary): add author_engagement_extrema template (ID 96)

Find the book with the highest/lowest engagement metric among an
author's top N search results. Uses confirmed-visible fields only:
want_to_read_count, already_read_count, ratings_count.

Variant space: 70 authors × 2 extrema × 3 metrics × 4 counts = 1,680.

* feat(openlibrary): add author_comparison template (ID 97)

Compare aggregate engagement metrics between two authors' top N search
results. Requires two separate author searches and cross-page comparison.

Variant space: C(70,2) × 3 metrics × 2 counts = 14,490.

* feat(openlibrary): add reading_stats_filter template (ID 98)

Count books in an author's catalog meeting an engagement threshold.
Requires scanning each book's metric against a threshold — cannot be
solved by sorting a single column.

Variant space: 70 authors × 3 metrics × 4 thresholds × 2 counts = 1,680.

* test(openlibrary): add tests for engagement & comparison templates

56 tests covering:
- Template registration and generation invariants
- author_engagement_extrema GT: highest/lowest, tie-breaking, missing data
- author_comparison GT: higher total, reverse winner, tie, missing author
- reading_stats_filter GT: threshold counting, zero matches, exact boundary
- Task registry wiring (IDs 96, 97, 98, Version 7)
- Shared helper refactoring (common.py functions)
- Cross-template consistency (serialization, GT source, cache source)

* fix: accept plain-text author queries in find_author_search_entry

* fix(openlibrary): reduce live GT not_collected for author templates

* docs(pr): update description

* fix: address PR #13 review — remove broken authors, drop already_read_count, clean up

BLOCKING fixes:
- Remove 9 authors from AUTHOR_POOL: 4 broken on OL API (<10 results:
  Dostoevsky, Murakami, Chekhov, Octavia Butler) and 5 with sparse
  ratings_count (<50% present in top 10: Bronte, Tolstoy, Whitman,
  Dickinson, Tagore). Pool: 70 → 61.
- Remove already_read_count from EngagementMetric, AuthorMetric, and
  ReaderMetric enums — not visible on search results page (only
  want_to_read and ratings counts are rendered).

NON-BLOCKING fixes:
- Add comment in author_editions.py documenting allow_unsorted_fallback
  asymmetry between existing and new templates.
- Remove pr_description.md from repository.

Tests updated to reflect metric and pool changes. 106 passed.

* fix: treat missing engagement metrics as 0 instead of hard-failing

The OL API omits count fields (ratings_count, want_to_read_count) when
the value is zero, rather than returning 0. Previously the GT methods
returned GroundTruthResult.fail() for missing fields, causing hard
failures for works that simply haven't been rated yet.

Now treats absent metrics as 0.0, which is semantically correct and
consistent with how the OL API represents zero-count data. This
prevents GT failures for individual works missing ratings_count even
among authors that generally have good data coverage.

Also fixes _make_search_entry type hint (sort: Optional[str]) and
removes unused title variables flagged by ruff.

* fix: handle non-numeric metric values without TypeError

If a metric field contains a non-numeric string like 'N/A',
parse_numeric() returns None. Previously this None was passed to
int(value) or numeric comparisons, causing a TypeError at runtime.

Now the fallback chain is: raw → parse_numeric(raw) → 0.0 if None.
This covers both absent fields (raw is None) and non-numeric strings
(parse_numeric returns None).

Adds regression test for 'N/A' metric values.

* refactor: extract safe_metric_value helper to reduce duplication

The 3-line metric normalization pattern (raw → parse_numeric → fallback
to 0.0) was duplicated across all 3 new templates. Extracted to
safe_metric_value() in common.py, reducing each call site to a single
line and ensuring consistent handling of absent/non-numeric fields.

* fix: drop ratings_count from all templates, fail on non-numeric data

BLOCKING: ratings_count is missing for 56% of authors in the OL API,
causing wrong GT for extrema-lowest queries (missing-as-zero always
wins). Dropped ratings_count from EngagementMetric, AuthorMetric, and
ReaderMetric — all templates now use only want_to_read_count.

Expanded RESULT_COUNTS to keep variant space above 500 minimum:
- T96 (engagement_extrema): [3,5,7,10,15] → 61×2×1×5 = 610
- T97 (comparison): unchanged [3,5] → C(61,2)×1×2 = 3,660
- T98 (reading_stats_filter): [5,10,15] → 61×1×4×3 = 732

NON-BLOCKING: safe_metric_value now raises ValueError on non-null
non-numeric values (e.g. 'N/A') instead of silently treating them
as 0. Missing (None) values still default to 0. Callers catch
ValueError and surface it as GroundTruthResult.fail().

* fix: docstring drift and add non-numeric regression tests for comparison/filter

- Fix docstrings in author_engagement_extrema.py and reading_stats_filter.py
  that still mentioned 'ratings' after ratings_count was dropped.
- Add non-numeric metric regression tests for comparison and filter templates
  to match the existing extrema test, ensuring all 3 safe_metric_value
  call sites are explicitly tested for ValueError handling.

* fix: restore ratings_count with targeted exclusions for anti-memorization

BLOCKING: With a single metric (want_to_read_count), the entire answer
space was enumerable from 61 API calls (~5,000 entries). Restoring
ratings_count as a second metric dimension breaks trivial enumeration.

Changes:
- Remove 5 authors with worst ratings_count coverage (Emerson, Joyce,
  Melville, Hawthorne, P.K. Dick). Pool: 61 → 56.
- Restore ratings_count to EngagementMetric, AuthorMetric, ReaderMetric.
- T96: exclude ratings_count from extrema=lowest only (where
  missing-as-zero would always win). Highest/comparison/filter are
  unaffected by the bias.
- T96 RESULT_COUNTS expanded to [3,5,7,10,12,15] (6 values).
- Restore THRESHOLDS for ratings_count in T98.

Variant spaces (all >1000):
- T96: 56 × (highest×2 + lowest×1) × 6 = 1,008
- T97: C(56,2) × 2 × 2 = 6,160
- T98: 56 × 2 × 4 × 3 = 1,344

Adds test_extrema_lowest_excludes_ratings_count to verify the
per-extrema metric filtering. 364 tests pass.

* fix(openlibrary): expand AUTHOR_POOL and RESULT_COUNTS for T96 variant space

- Add 25 authors to AUTHOR_POOL (56→81) for anti-memorization
- Change T96 RESULT_COUNTS from [3,5,7,10,12,15] to [3,5,7,10,15,20,25]
  to increase lowest-extrema differentiation
- Effective variant space: ~583 (16.6% margin above 500 threshold)
- Update docstrings: T96=1,701 T97=12,960 T98=1,944 variants
- Fix AUTHOR_POOL section comments to reflect actual counts
- Split test file (481+490 lines, both <500)
- Remove unused get_registered_templates import
- Add tests: pool size=81, no duplicates, ratings_count GT

* fix(openlibrary): raise search fetch limit to 25 for T96 work_count=25

The collector hardcoded limit=20 but RESULT_COUNTS includes 25, causing
guaranteed GT failure for 1/7 of T96 variants. Raise limit to match.

Add regression test: test_extrema_gt_succeeds_with_25_works

* fix(openlibrary): separate ENGAGEMENT_AUTHOR_POOL, cap lowest RESULT_COUNTS

Address PR review #8:

1. BLOCKING: Restore original AUTHOR_POOL (70 authors) exactly as on main
   to preserve author_editions reproducibility. Create separate
   ENGAGEMENT_AUTHOR_POOL (81 authors) for T96/T97/T98.

2. BLOCKING: Add _LOWEST_RESULT_COUNTS=[3,5,7] for lowest extrema to
   avoid missing-as-zero domination of want_to_read_count at high
   work_counts (41% of authors affected at work_count=25).

3. NON-BLOCKING: Add comment explaining limit=25 in openlibrary.py.

Variant space update: T96 = 81 × (2×7 + 1×3) = 1,377 nominal variants.

* fix(openlibrary): address PR #13 review — deterministic GT, numeric T97, strict metrics

BLOCKING fixes:
- Remove allow_unsorted_fallback=True from all 3 templates (T96/T97/T98).
  GT now strictly requires sort=editions data, matching the question text.
  If the agent doesn't visit the sorted page, GT correctly returns
  not_collected instead of silently using wrong-order results.

- Make safe_metric_value fail on missing ratings_count instead of
  defaulting to 0. Only want_to_read_count (high API coverage) defaults
  to 0 when absent. ratings_count absence raises ValueError → GT fail,
  preventing semantically wrong answers from sparse data.

- Redesign T97 (author_comparison) from binary "which author has more?"
  (50% random baseline) to numeric "what is the absolute difference?"
  (near-0% random baseline). GT returns str(abs(sum_a - sum_b)).

- Add Version 7 coordination comment for PR #14 (IDs 99-101 → Version 8).

NON-BLOCKING fixes:
- Derive ENGAGEMENT_AUTHOR_POOL from AUTHOR_POOL via exclusion set +
  additions list, eliminating 56-entry duplication and preventing drift.
  AUTHOR_POOL itself is unchanged (author_editions reproducibility).

- Remove stale allow_unsorted_fallback asymmetry comment from
  author_editions.py (all templates now consistently use strict sort).

Tests: 372 passed (118 OpenLibrary, 254 other).

* fix(openlibrary): cap ratings_count variants to low N to reduce GT-fail from sparse OL data

ratings_count is missing for 20-40% of authors at N≥7. Restrict
ratings_count variants to N∈{3,5} (T96) and N=5 (T98) where
coverage is highest, cutting estimated GT-fail exposure from
~14%/~26% to ~4%/~11%. T97 already at [3,5] — unchanged.

* test(openlibrary): verify GT computation with real OL API data

Fetch live data (March 26, 2026) for Agatha Christie, Stephen King,
and Neil Gaiman via sort=editions search API.  Inject into GT collector
and verify all three templates (T96/T97/T98) return concrete values
with both want_to_read_count and ratings_count metrics.

12 tests cover: highest/lowest extrema, cross-author numeric difference,
and threshold counting — satisfying CLAUDE.md §5 item 1.

---------

Co-authored-by: mkdev11 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant