Skip to content

fix: eliminate redundant ArXiv fetch and add page retry for transient failures#15

Open
MkDev11 wants to merge 6 commits intoAffineFoundation:mainfrom
MkDev11:fix/arxiv-page-fetch-failed
Open

fix: eliminate redundant ArXiv fetch and add page retry for transient failures#15
MkDev11 wants to merge 6 commits intoAffineFoundation:mainfrom
MkDev11:fix/arxiv-page-fetch-failed

Conversation

@MkDev11
Copy link
Contributor

@MkDev11 MkDev11 commented Mar 25, 2026

Problem

ArXiv listing pages fail with "cached error: Page fetch failed" due to two compounding issues: (1) Playwright and aiohttp concurrently fetch the exact same URL, doubling rate-limit exposure, and (2) _fetch_page has zero retries for transient Playwright failures (timeouts, 5xx, connection resets). Additionally, ArxivClient.fetch_listing silently returns [] on HTTP 429 instead of retrying.

Root Cause

_ensure_single in cache.py runs _fetch_page (Playwright) and plugin.fetch_api_data (aiohttp) concurrently for ArXiv — but both hit arxiv.org/list/<cat>/new. Two requests to the same host within milliseconds triggers rate limiting. When Playwright fails transiently, there is no retry — the error propagates immediately as CacheFatalError.

What I changed

  • BasePlugin.extract_api_data_from_html: New optional method that lets plugins derive GT data from already-fetched browser HTML. Returns None by default (no behavior change for existing plugins).
  • ArxivPlugin.extract_api_data_from_html: Overrides the above — parses GT directly from Playwright HTML using the existing parse_listing_html, eliminating the redundant aiohttp request entirely.
  • CacheManager._fetch_page_with_retry: Wraps _fetch_page with 1 retry (3s delay) for transient failures. Permanent failures (HTTP 4xx, CAPTCHA) raise immediately without retry.
  • CacheManager._ensure_single: Restructured to call extract_api_data_from_html first, falling back to fetch_api_data only when the plugin returns None.
  • ArxivClient.fetch_listing: Retries HTTP 429 (with Retry-After support), raises APIFetchError on failure instead of silently returning [].
  • build_listing_api_data: Extracted as a reusable pure function from fetch_listing_api_data.

Testing

pytest tests/ -q  # 309 passed

New tests (14):

  • test_extract_api_data_from_html_parses_listing_page — regression test
  • test_extract_api_data_from_html_returns_none_for_non_listing_url
  • test_extract_api_data_from_html_raises_when_no_new_papers
  • test_extract_api_data_from_html_handles_recent_url
  • test_build_listing_api_data_assigns_ranks
  • test_build_listing_api_data_raises_on_empty_list
  • test_succeeds_on_first_attempt
  • test_retries_on_transient_error
  • test_retries_on_http_5xx
  • test_does_not_retry_http_4xx
  • test_does_not_retry_captcha
  • test_raises_after_all_retries_exhausted
  • test_skips_fetch_api_data_when_html_extraction_works
  • test_falls_back_to_fetch_api_data_when_html_extraction_returns_none

Edge Cases Handled

  • Non-listing ArXiv URLs (e.g. /abs/) → extract_api_data_from_html returns None, falls back to fetch_api_data
  • Empty listing page (weekend/holiday) → APIFetchError raised, not silent empty
  • Permanent Playwright failures (HTTP 4xx, CAPTCHA) → no retry, immediate raise
  • Transient Playwright failures (timeout, 5xx) → 1 retry with delay
  • HTTP 429 from ArXiv → retry with Retry-After header support
  • Plugins without extract_api_data_from_html → default returns None, existing concurrent fetch behavior preserved

@MkDev11 MkDev11 force-pushed the fix/arxiv-page-fetch-failed branch from be6e96d to 1dd663b Compare March 25, 2026 07:37
Resolve test conflict in tests/core/test_cache.py by preserving both upstream cache-validation coverage and branch retry/HTML-extraction coverage.
Copy link
Contributor

@angosr angosr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #15 — fix: eliminate redundant ArXiv fetch and add page retry

Significance Gate: PASS

This fixes a real reliability problem — ArXiv listing pages fail because Playwright and aiohttp concurrently hit the same URL, doubling rate-limit exposure. The direction is correct.

BLOCKING: Performance regression for all non-ArXiv plugins

The PR replaces the concurrent page+API fetch in _ensure_single with a sequential flow: fetch page first → try extract_api_data_from_html → fall back to fetch_api_data.

This is correct for ArXiv (same URL for both), but removes concurrency for ALL plugins. CoinGecko, Stooq, OpenLibrary, OpenMeteo, Taostats, and HackerNews all have separate API endpoints — their fetch_api_data hits different URLs than the browser page. For these plugins:

  • Before: max(page_time, api_time) ≈ 5-30s (concurrent)
  • After: page_time + api_time ≈ 6-35s (sequential)

This adds 1-5 seconds per page for 6 out of 7 plugins. Over a full evaluation run with dozens of page visits, this compounds significantly.

Fix: Preserve concurrency for plugins that don't override extract_api_data_from_html. Check the method first, then choose the fetch strategy:

if need_api:
    html, accessibility_tree = await self._fetch_page_with_retry(url, plugin)
    api_data = plugin.extract_api_data_from_html(url, html)
    if api_data is None:
        # Plugin needs separate API call — but this is now sequential.
        # To preserve concurrency, start API fetch BEFORE page fetch:
        api_data = await plugin.fetch_api_data(url)

Or better: check if the plugin overrides extract_api_data_from_html before fetching the page, and branch:

if need_api and plugin.extract_api_data_from_html is not BasePlugin.extract_api_data_from_html:
    # Plugin can derive GT from HTML — sequential is fine
    html, tree = await self._fetch_page_with_retry(url, plugin)
    api_data = plugin.extract_api_data_from_html(url, html)
else:
    # Concurrent fetch (existing behavior with retry added)
    page_task = asyncio.ensure_future(self._fetch_page_with_retry(url, plugin))
    api_task = asyncio.ensure_future(plugin.fetch_api_data(url)) if need_api else None
    html, tree = await page_task
    api_data = await api_task if api_task else None

Everything else is solid

  1. extract_api_data_from_html pattern ✅ — Clean extension point. BasePlugin returns None (no-op), ArxivPlugin parses listing HTML. Well-designed.

  2. _fetch_page_with_retry ✅ — Retries transient failures (timeout, 5xx), raises immediately on permanent failures (4xx, CAPTCHA). _MAX_PAGE_RETRIES=2 with _PAGE_RETRY_DELAY=3.0 is reasonable.

  3. fetch_listing error handling ✅ — return []raise APIFetchError is the correct fix per CLAUDE.md "No Fallback" rule. Silent empty returns mask errors.

  4. build_listing_api_data extraction ✅ — Clean refactoring to enable reuse between fetch_listing_api_data and extract_api_data_from_html.

  5. Test coverage ✅ — 14 new tests covering retry, HTML extraction, fallback, and edge cases. All well-structured.

Minor (non-blocking)

  • _fetch_page_with_retry uses string matching for permanent error detection ("HTTP 4" in msg, "CAPTCHA" in msg). This is fragile if error message format changes. Consider using error attributes or a subclass instead. Note that "HTTP 4" would also match "HTTP 429" (rate limit), which should be retryable — but in practice _fetch_page (Playwright) is unlikely to surface HTTP 429 as a CacheFatalError message, so this is a theoretical concern.

Required Actions

  1. Preserve concurrent page+API fetch for plugins that don't override extract_api_data_from_html. Only use sequential flow when the plugin can derive GT from HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants