Skip to content

perf(sync): incremental journal/conference fetch via Railway archive#20

Merged
kishormorol merged 1 commit into
mainfrom
feat/incremental-sync-railway
Jun 7, 2026
Merged

perf(sync): incremental journal/conference fetch via Railway archive#20
kishormorol merged 1 commit into
mainfrom
feat/incremental-sync-railway

Conversation

@kishormorol

Copy link
Copy Markdown
Owner

Problem

Journal-sync and conference-sync re-download the entire corpus every run. The latest journal-sync run took ~2.6 hrs, with 157.9 min (99%) spent in the OpenAlex journal fetch — re-pulling the full 2022→now window across all 25 journals (NatComms alone is ~44k works). The fetch never consulted what was already stored, so dedup silently discarded ~99% of re-downloaded papers.

Fix

Make every heavy sync fetch incremental, backed by the complete uncapped archive that already lives in Railway Postgres. (The committed *_db.json files are capped top-N working sets, not complete archives — and a full conferences_db.json would exceed GitHub's 100 MB file limit, so it can't live in git.)

  • Journals — OpenAlex from_created_date filter pulls only works indexed since the last run. Watermark in site/data/journal_sync_state.json (minus 7-day lookback).
  • Conferences — proceedings are immutable per (venue, year); settled past-year blocks are skipped via skip_keys threaded into every conference connector's fetch_all (openreview/pmlr/cvf/s2/acl). Current calendar year is always refetched.
  • OpenAlex concept bulk-fetch (ML/NLP/CV/IR) — was re-paginating entire concepts (100k+ works) uncapped each run; now incremental too.

Safety

  • Incremental/skip engages only when Railway loads successfully (railway_store.load / _load_complete_archive). If Railway is down → full re-fetch, never silent data loss.
  • A venue-name mismatch can only cause a missed skip (full fetch), never a wrong skip.

Rollout

  1. Merge.
  2. One seeding run does a full backfill (last slow run) to populate Railway + write the watermark.
  3. Every run after → incremental. Hours → minutes.

Tests

142 tests pass locally (2 pre-existing env-only collection errors — libopenblas, tinydb — unrelated). Verified the incremental filters emit from_created_date, full mode omits it, settled-keys logic, and the Railway dict → Paper round-trip.

🤖 Generated with Claude Code

Journal- and conference-sync were re-downloading the entire corpus every
run (the journal step alone took ~2.6 hrs, dominated by OpenAlex NatComms
~44k works), because the fetch never consulted what was already stored.

Make all heavy sync fetches incremental, backed by the complete uncapped
archive that already lives in Railway Postgres (the committed *_db.json
files are capped top-N working sets and can't be complete archives — a full
conferences_db.json would exceed GitHub's 100MB file limit):

- Journals: OpenAlex `from_created_date` filter pulls only works indexed
  since the last run. Watermark in site/data/journal_sync_state.json
  (minus 7-day lookback). See _load_journal_watermark / _fetch_journal_papers.
- Conferences: settled past-year (venue, year) blocks are skipped via
  skip_keys threaded into every conference connector's fetch_all
  (openreview/pmlr/cvf/s2/acl); current calendar year always refetched.
  See _settled_conf_keys.
- OpenAlex concept bulk-fetch (ML/NLP/CV/IR) is incremental too — it was
  re-paginating entire concepts (100k+ works) uncapped each run.

Safety: incremental/skip engages only when Railway loads successfully
(railway_store.load / _load_complete_archive). If Railway is down → full
re-fetch, never silent data loss. A venue-name mismatch only causes a
missed skip (full fetch), never a wrong skip.

After one seeding backfill, sync runs drop from hours to minutes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kishormorol kishormorol merged commit c80757a into main Jun 7, 2026
1 check passed
@kishormorol kishormorol deleted the feat/incremental-sync-railway branch June 7, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant