fix: eliminate redundant ArXiv fetch and add page retry for transient failures#15
fix: eliminate redundant ArXiv fetch and add page retry for transient failures#15MkDev11 wants to merge 6 commits intoAffineFoundation:mainfrom
Conversation
be6e96d to
1dd663b
Compare
Resolve test conflict in tests/core/test_cache.py by preserving both upstream cache-validation coverage and branch retry/HTML-extraction coverage.
angosr
left a comment
There was a problem hiding this comment.
Review: PR #15 — fix: eliminate redundant ArXiv fetch and add page retry
Significance Gate: PASS
This fixes a real reliability problem — ArXiv listing pages fail because Playwright and aiohttp concurrently hit the same URL, doubling rate-limit exposure. The direction is correct.
BLOCKING: Performance regression for all non-ArXiv plugins
The PR replaces the concurrent page+API fetch in _ensure_single with a sequential flow: fetch page first → try extract_api_data_from_html → fall back to fetch_api_data.
This is correct for ArXiv (same URL for both), but removes concurrency for ALL plugins. CoinGecko, Stooq, OpenLibrary, OpenMeteo, Taostats, and HackerNews all have separate API endpoints — their fetch_api_data hits different URLs than the browser page. For these plugins:
- Before:
max(page_time, api_time)≈ 5-30s (concurrent) - After:
page_time + api_time≈ 6-35s (sequential)
This adds 1-5 seconds per page for 6 out of 7 plugins. Over a full evaluation run with dozens of page visits, this compounds significantly.
Fix: Preserve concurrency for plugins that don't override extract_api_data_from_html. Check the method first, then choose the fetch strategy:
if need_api:
html, accessibility_tree = await self._fetch_page_with_retry(url, plugin)
api_data = plugin.extract_api_data_from_html(url, html)
if api_data is None:
# Plugin needs separate API call — but this is now sequential.
# To preserve concurrency, start API fetch BEFORE page fetch:
api_data = await plugin.fetch_api_data(url)Or better: check if the plugin overrides extract_api_data_from_html before fetching the page, and branch:
if need_api and plugin.extract_api_data_from_html is not BasePlugin.extract_api_data_from_html:
# Plugin can derive GT from HTML — sequential is fine
html, tree = await self._fetch_page_with_retry(url, plugin)
api_data = plugin.extract_api_data_from_html(url, html)
else:
# Concurrent fetch (existing behavior with retry added)
page_task = asyncio.ensure_future(self._fetch_page_with_retry(url, plugin))
api_task = asyncio.ensure_future(plugin.fetch_api_data(url)) if need_api else None
html, tree = await page_task
api_data = await api_task if api_task else NoneEverything else is solid
-
extract_api_data_from_htmlpattern ✅ — Clean extension point.BasePluginreturnsNone(no-op),ArxivPluginparses listing HTML. Well-designed. -
_fetch_page_with_retry✅ — Retries transient failures (timeout, 5xx), raises immediately on permanent failures (4xx, CAPTCHA)._MAX_PAGE_RETRIES=2with_PAGE_RETRY_DELAY=3.0is reasonable. -
fetch_listingerror handling ✅ —return []→raise APIFetchErroris the correct fix per CLAUDE.md "No Fallback" rule. Silent empty returns mask errors. -
build_listing_api_dataextraction ✅ — Clean refactoring to enable reuse betweenfetch_listing_api_dataandextract_api_data_from_html. -
Test coverage ✅ — 14 new tests covering retry, HTML extraction, fallback, and edge cases. All well-structured.
Minor (non-blocking)
_fetch_page_with_retryuses string matching for permanent error detection ("HTTP 4" in msg,"CAPTCHA" in msg). This is fragile if error message format changes. Consider using error attributes or a subclass instead. Note that"HTTP 4"would also match"HTTP 429"(rate limit), which should be retryable — but in practice_fetch_page(Playwright) is unlikely to surface HTTP 429 as a CacheFatalError message, so this is a theoretical concern.
Required Actions
- Preserve concurrent page+API fetch for plugins that don't override
extract_api_data_from_html. Only use sequential flow when the plugin can derive GT from HTML.
Problem
ArXiv listing pages fail with "cached error: Page fetch failed" due to two compounding issues: (1) Playwright and aiohttp concurrently fetch the exact same URL, doubling rate-limit exposure, and (2)
_fetch_pagehas zero retries for transient Playwright failures (timeouts, 5xx, connection resets). Additionally,ArxivClient.fetch_listingsilently returns[]on HTTP 429 instead of retrying.Root Cause
_ensure_singleincache.pyruns_fetch_page(Playwright) andplugin.fetch_api_data(aiohttp) concurrently for ArXiv — but both hitarxiv.org/list/<cat>/new. Two requests to the same host within milliseconds triggers rate limiting. When Playwright fails transiently, there is no retry — the error propagates immediately asCacheFatalError.What I changed
BasePlugin.extract_api_data_from_html: New optional method that lets plugins derive GT data from already-fetched browser HTML. ReturnsNoneby default (no behavior change for existing plugins).ArxivPlugin.extract_api_data_from_html: Overrides the above — parses GT directly from Playwright HTML using the existingparse_listing_html, eliminating the redundant aiohttp request entirely.CacheManager._fetch_page_with_retry: Wraps_fetch_pagewith 1 retry (3s delay) for transient failures. Permanent failures (HTTP 4xx, CAPTCHA) raise immediately without retry.CacheManager._ensure_single: Restructured to callextract_api_data_from_htmlfirst, falling back tofetch_api_dataonly when the plugin returnsNone.ArxivClient.fetch_listing: Retries HTTP 429 (with Retry-After support), raisesAPIFetchErroron failure instead of silently returning[].build_listing_api_data: Extracted as a reusable pure function fromfetch_listing_api_data.Testing
New tests (14):
test_extract_api_data_from_html_parses_listing_page— regression testtest_extract_api_data_from_html_returns_none_for_non_listing_urltest_extract_api_data_from_html_raises_when_no_new_paperstest_extract_api_data_from_html_handles_recent_urltest_build_listing_api_data_assigns_rankstest_build_listing_api_data_raises_on_empty_listtest_succeeds_on_first_attempttest_retries_on_transient_errortest_retries_on_http_5xxtest_does_not_retry_http_4xxtest_does_not_retry_captchatest_raises_after_all_retries_exhaustedtest_skips_fetch_api_data_when_html_extraction_workstest_falls_back_to_fetch_api_data_when_html_extraction_returns_noneEdge Cases Handled
/abs/) →extract_api_data_from_htmlreturnsNone, falls back tofetch_api_dataAPIFetchErrorraised, not silent emptyextract_api_data_from_html→ default returnsNone, existing concurrent fetch behavior preserved