Refactor ROR reference data to use S3-backed dynamic caching by Copilot · Pull Request #1486 · datacite/lupo

Copilot · 2026-02-24T11:14:35Z

Issue - https://github.com/datacite/product-backlog/issues/672

Purpose

The purpose of this PR is to refactor how ROR (Research Organization Registry) reference data is stored and accessed. It moves away from
loading large JSON files into memory at boot time and instead implements a dynamic caching strategy that fetches data from S3.

Approach

The implementation introduces a new service, RorReferenceStore, which manages the retrieval and caching of ROR mappings. Instead of loading global constants FUNDER_TO_ROR and ROR_HIERARCHY from local files, the system now performs per-key lookups against the Rails cache (e.g., Redis or Memcached). If the cache is cold, it downloads the mapping files from an S3 bucket and populates the cache.

Key Modifications

New Service: Created app/services/ror_reference_store.rb to handle S3 downloads and cache management.
Model Concern: Updated app/models/concerns/rorable.rb to use the new service for ROR lookups instead of local constants.
Removed Static Assets: Deleted config/initializers/load_ror_data.rb and references to local JSON files in app/resources/ (assumed, as the initializer was removed).
Rake Task: Added lib/tasks/ror_reference_cache.rake to allow manual or scheduled refreshing of the cache from S3.
Testing: Added comprehensive specs for RorReferenceStore and updated rorable_spec.rb to mock the new service.

Important Technical Details

Granular Caching: Instead of caching the entire JSON hash as one blob, keys are stored individually (e.g., ror_ref/funder_to_ror/<id>). This prevents high memory overhead and "Fat" cache entries.
Sentinel Key: Uses a populated key suffix to determine if a cache miss is due to a missing value or a cold cache that needs refreshing from S3.
TTL: Cached items have a default expiration of 31 days.
S3 Integration: Requires the ROR_ANALYSIS_S3_BUCKET environment variable to be set.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

Be humble in the language and feedback you give, ask don't tell.
Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
Offer suggestions on how to improve code e.g. simplification or expanding clarity.
Ensure you give reasons for the changes you are proposing.

jrhoads · 2026-02-24T16:46:23Z

Performance Concern: Per-Lookup Hash Assembly at Scale

I have a concern about the chunked caching design given how these mappings are actually used in production.

The Problem

The current implementation assembles the entire mapping Hash on every single lookup:

Reads N chunks from Memcached
Joins them into a full JSON string
Parses the full JSON into a Ruby Hash (up to 7 MB for ror_to_countries)
Returns the Hash to the caller, who extracts one value
The Hash is then garbage collected

This is called from within Shuriken workers and can run several million times. At that scale this means:

Millions of multi-chunk Memcached reads
Millions of large Hash allocations (up to 7 MB each)
Millions of full JSON parse operations
Constant GC pressure from large short-lived objects

Previously, the frozen constants benefited from Ruby/Passenger's copy-on-write optimization — with 12 Passenger workers, there was effectively one shared copy of the data in physical RAM. The new approach could result in up to 12 simultaneous 7 MB Hash allocations per concurrent request wave, all being GC'd after each use.

Proposed Alternative: Value-Level Caching

Rather than caching at the chunk level and reassembling the whole Hash, I'd suggest caching at the individual key-value level. On S3 download, iterate through the Hash and write each entry as its own cache key. Lookups then go directly to Memcached with the specific key needed — no Hash assembly, no JSON parsing, no large allocations.

The warm-up cost (writing potentially tens of thousands of keys) is negligible because:

These mappings are only regenerated once a month by the sample-ror-analysis workflow
Warm-up happens via the rake task, not during live requests
Memcached can handle hundreds of thousands of writes per second

ashwinisukale · 2026-02-25T08:37:07Z

Since reindexing processes millions of DOIs, value-level caching would result in millions of cache network calls. Instead, we could lazily load the mapping once per Shoryuken process and reuse it in memory, eliminating per-DOI network overhead while keeping boot-time unaffected. Given our current dataset size.

jrhoads · 2026-02-25T13:41:33Z

Updating the description for the new approach

…ils cache - Add RorReferenceStore service with chunked Memcached caching (512 KB chunks) - Add ror:refresh_reference_cache rake task - Update Rorable concern to use RorReferenceStore - Remove load_ror_data.rb initializer and app/resources JSON files - Update rorable_spec and add ror_reference_store_spec Co-authored-by: ashwinisukale <[email protected]>

…ve nil assertions

…f chunking

Copilot AI assigned Copilot and ashwinisukale Feb 24, 2026

Copilot started work on behalf of ashwinisukale February 24, 2026 11:14 View session

Copilot AI changed the title ~~[WIP] Switch ROR mapping files to S3-backed caching~~ Switch ROR mapping files from local resources + boot-time constants to S3-backed chunked Rails cache Feb 24, 2026

Copilot AI requested a review from ashwinisukale February 24, 2026 11:20

Copilot finished work on behalf of ashwinisukale February 24, 2026 11:20

jrhoads changed the title ~~Switch ROR mapping files from local resources + boot-time constants to S3-backed chunked Rails cache~~ Refactor ROR reference data to use S3-backed dynamic caching Feb 25, 2026

Copilot AI and others added 9 commits February 26, 2026 13:45

Initial plan

054dd94

refactor: change RorReferenceStore to individual key-value caching

fc2c74b

test: refactor Rorable spec to use specific method stubbing and impro…

89f4899

…ve nil assertions

test: update RorReferenceStore tests to use per-key caching instead o…

f9e4434

…f chunking

perf: optimize cache lookup logic and fix cold cache test cases

0f4fcbd

Appease rubocop

b2b7b0e

test: add ROR reference store test helpers and shared context

f2de6d0

test: include ROR funding stubs context in DOI model and request specs

daebf99

jrhoads force-pushed the copilot/switch-ror-mapping-to-s3-cache branch from 0e3fc6e to daebf99 Compare February 26, 2026 12:47

jrhoads marked this pull request as ready for review February 26, 2026 13:01

jrhoads requested a review from a team February 26, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ROR reference data to use S3-backed dynamic caching#1486

Refactor ROR reference data to use S3-backed dynamic caching#1486
Copilot wants to merge 9 commits intomasterfrom
copilot/switch-ror-mapping-to-s3-cache

Copilot AI commented Feb 24, 2026 •

edited by ashwinisukale

Loading

Uh oh!

jrhoads commented Feb 24, 2026

Uh oh!

ashwinisukale commented Feb 25, 2026

Uh oh!

jrhoads commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Feb 24, 2026 • edited by ashwinisukale Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Approach

Key Modifications

Important Technical Details

Types of changes

Reviewer, please remember our guidelines:

Uh oh!

jrhoads commented Feb 24, 2026

Performance Concern: Per-Lookup Hash Assembly at Scale

The Problem

Proposed Alternative: Value-Level Caching

Uh oh!

ashwinisukale commented Feb 25, 2026

Uh oh!

jrhoads commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 24, 2026 •

edited by ashwinisukale

Loading