perf(sync): incremental journal/conference fetch via Railway archive#20
Merged
Conversation
Journal- and conference-sync were re-downloading the entire corpus every run (the journal step alone took ~2.6 hrs, dominated by OpenAlex NatComms ~44k works), because the fetch never consulted what was already stored. Make all heavy sync fetches incremental, backed by the complete uncapped archive that already lives in Railway Postgres (the committed *_db.json files are capped top-N working sets and can't be complete archives — a full conferences_db.json would exceed GitHub's 100MB file limit): - Journals: OpenAlex `from_created_date` filter pulls only works indexed since the last run. Watermark in site/data/journal_sync_state.json (minus 7-day lookback). See _load_journal_watermark / _fetch_journal_papers. - Conferences: settled past-year (venue, year) blocks are skipped via skip_keys threaded into every conference connector's fetch_all (openreview/pmlr/cvf/s2/acl); current calendar year always refetched. See _settled_conf_keys. - OpenAlex concept bulk-fetch (ML/NLP/CV/IR) is incremental too — it was re-paginating entire concepts (100k+ works) uncapped each run. Safety: incremental/skip engages only when Railway loads successfully (railway_store.load / _load_complete_archive). If Railway is down → full re-fetch, never silent data loss. A venue-name mismatch only causes a missed skip (full fetch), never a wrong skip. After one seeding backfill, sync runs drop from hours to minutes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Journal-sync and conference-sync re-download the entire corpus every run. The latest journal-sync run took ~2.6 hrs, with 157.9 min (99%) spent in the OpenAlex journal fetch — re-pulling the full 2022→now window across all 25 journals (NatComms alone is ~44k works). The fetch never consulted what was already stored, so dedup silently discarded ~99% of re-downloaded papers.
Fix
Make every heavy sync fetch incremental, backed by the complete uncapped archive that already lives in Railway Postgres. (The committed
*_db.jsonfiles are capped top-N working sets, not complete archives — and a fullconferences_db.jsonwould exceed GitHub's 100 MB file limit, so it can't live in git.)from_created_datefilter pulls only works indexed since the last run. Watermark insite/data/journal_sync_state.json(minus 7-day lookback).(venue, year); settled past-year blocks are skipped viaskip_keysthreaded into every conference connector'sfetch_all(openreview/pmlr/cvf/s2/acl). Current calendar year is always refetched.Safety
railway_store.load/_load_complete_archive). If Railway is down → full re-fetch, never silent data loss.Rollout
Tests
142 tests pass locally (2 pre-existing env-only collection errors —
libopenblas,tinydb— unrelated). Verified the incremental filters emitfrom_created_date, full mode omits it, settled-keys logic, and the Railway dict →Paperround-trip.🤖 Generated with Claude Code