perf(sync): incremental journal/conference fetch via Railway archive by kishormorol · Pull Request #20 · kishormorol/ResearchScope

kishormorol · 2026-06-07T01:52:52Z

Problem

Journal-sync and conference-sync re-download the entire corpus every run. The latest journal-sync run took ~2.6 hrs, with 157.9 min (99%) spent in the OpenAlex journal fetch — re-pulling the full 2022→now window across all 25 journals (NatComms alone is ~44k works). The fetch never consulted what was already stored, so dedup silently discarded ~99% of re-downloaded papers.

Fix

Make every heavy sync fetch incremental, backed by the complete uncapped archive that already lives in Railway Postgres. (The committed *_db.json files are capped top-N working sets, not complete archives — and a full conferences_db.json would exceed GitHub's 100 MB file limit, so it can't live in git.)

Journals — OpenAlex from_created_date filter pulls only works indexed since the last run. Watermark in site/data/journal_sync_state.json (minus 7-day lookback).
Conferences — proceedings are immutable per (venue, year); settled past-year blocks are skipped via skip_keys threaded into every conference connector's fetch_all (openreview/pmlr/cvf/s2/acl). Current calendar year is always refetched.
OpenAlex concept bulk-fetch (ML/NLP/CV/IR) — was re-paginating entire concepts (100k+ works) uncapped each run; now incremental too.

Safety

Incremental/skip engages only when Railway loads successfully (railway_store.load / _load_complete_archive). If Railway is down → full re-fetch, never silent data loss.
A venue-name mismatch can only cause a missed skip (full fetch), never a wrong skip.

Rollout

Merge.
One seeding run does a full backfill (last slow run) to populate Railway + write the watermark.
Every run after → incremental. Hours → minutes.

Tests

142 tests pass locally (2 pre-existing env-only collection errors — libopenblas, tinydb — unrelated). Verified the incremental filters emit from_created_date, full mode omits it, settled-keys logic, and the Railway dict → Paper round-trip.

🤖 Generated with Claude Code

Journal- and conference-sync were re-downloading the entire corpus every run (the journal step alone took ~2.6 hrs, dominated by OpenAlex NatComms ~44k works), because the fetch never consulted what was already stored. Make all heavy sync fetches incremental, backed by the complete uncapped archive that already lives in Railway Postgres (the committed *_db.json files are capped top-N working sets and can't be complete archives — a full conferences_db.json would exceed GitHub's 100MB file limit): - Journals: OpenAlex `from_created_date` filter pulls only works indexed since the last run. Watermark in site/data/journal_sync_state.json (minus 7-day lookback). See _load_journal_watermark / _fetch_journal_papers. - Conferences: settled past-year (venue, year) blocks are skipped via skip_keys threaded into every conference connector's fetch_all (openreview/pmlr/cvf/s2/acl); current calendar year always refetched. See _settled_conf_keys. - OpenAlex concept bulk-fetch (ML/NLP/CV/IR) is incremental too — it was re-paginating entire concepts (100k+ works) uncapped each run. Safety: incremental/skip engages only when Railway loads successfully (railway_store.load / _load_complete_archive). If Railway is down → full re-fetch, never silent data loss. A venue-name mismatch only causes a missed skip (full fetch), never a wrong skip. After one seeding backfill, sync runs drop from hours to minutes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kishormorol merged commit c80757a into main Jun 7, 2026
1 check passed

kishormorol deleted the feat/incremental-sync-railway branch June 7, 2026 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sync): incremental journal/conference fetch via Railway archive#20

perf(sync): incremental journal/conference fetch via Railway archive#20
kishormorol merged 1 commit into
mainfrom
feat/incremental-sync-railway

kishormorol commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kishormorol commented Jun 7, 2026

Problem

Fix

Safety

Rollout

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant