Skip to content

Refactor ROR reference data to use S3-backed dynamic caching#1486

Open
Copilot wants to merge 9 commits intomasterfrom
copilot/switch-ror-mapping-to-s3-cache
Open

Refactor ROR reference data to use S3-backed dynamic caching#1486
Copilot wants to merge 9 commits intomasterfrom
copilot/switch-ror-mapping-to-s3-cache

Conversation

Copy link

Copilot AI commented Feb 24, 2026

Issue - https://github.com/datacite/product-backlog/issues/672

Purpose

The purpose of this PR is to refactor how ROR (Research Organization Registry) reference data is stored and accessed. It moves away from
loading large JSON files into memory at boot time and instead implements a dynamic caching strategy that fetches data from S3.

Approach

The implementation introduces a new service, RorReferenceStore, which manages the retrieval and caching of ROR mappings. Instead of loading global constants FUNDER_TO_ROR and ROR_HIERARCHY from local files, the system now performs per-key lookups against the Rails cache (e.g., Redis or Memcached). If the cache is cold, it downloads the mapping files from an S3 bucket and populates the cache.

Key Modifications

  • New Service: Created app/services/ror_reference_store.rb to handle S3 downloads and cache management.
  • Model Concern: Updated app/models/concerns/rorable.rb to use the new service for ROR lookups instead of local constants.
  • Removed Static Assets: Deleted config/initializers/load_ror_data.rb and references to local JSON files in app/resources/ (assumed, as the initializer was removed).
  • Rake Task: Added lib/tasks/ror_reference_cache.rake to allow manual or scheduled refreshing of the cache from S3.
  • Testing: Added comprehensive specs for RorReferenceStore and updated rorable_spec.rb to mock the new service.

Important Technical Details

  • Granular Caching: Instead of caching the entire JSON hash as one blob, keys are stored individually (e.g., ror_ref/funder_to_ror/<id>). This prevents high memory overhead and "Fat" cache entries.
  • Sentinel Key: Uses a populated key suffix to determine if a cache miss is due to a missing value or a cold cache that needs refreshing from S3.
  • TTL: Cached items have a default expiration of 31 days.
  • S3 Integration: Requires the ROR_ANALYSIS_S3_BUCKET environment variable to be set.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Reviewer, please remember our guidelines:

  • Be humble in the language and feedback you give, ask don't tell.
  • Consider using positive language as opposed to neutral when offering feedback. This is to avoid the negative bias that can occur with neutral language appearing negative.
  • Offer suggestions on how to improve code e.g. simplification or expanding clarity.
  • Ensure you give reasons for the changes you are proposing.

Copilot AI changed the title [WIP] Switch ROR mapping files to S3-backed caching Switch ROR mapping files from local resources + boot-time constants to S3-backed chunked Rails cache Feb 24, 2026
Copilot AI requested a review from ashwinisukale February 24, 2026 11:20
@jrhoads
Copy link
Contributor

jrhoads commented Feb 24, 2026

Performance Concern: Per-Lookup Hash Assembly at Scale

I have a concern about the chunked caching design given how these mappings are actually used in production.

The Problem

The current implementation assembles the entire mapping Hash on every single lookup:

  1. Reads N chunks from Memcached
  2. Joins them into a full JSON string
  3. Parses the full JSON into a Ruby Hash (up to 7 MB for ror_to_countries)
  4. Returns the Hash to the caller, who extracts one value
  5. The Hash is then garbage collected

This is called from within Shuriken workers and can run several million times. At that scale this means:

  • Millions of multi-chunk Memcached reads
  • Millions of large Hash allocations (up to 7 MB each)
  • Millions of full JSON parse operations
  • Constant GC pressure from large short-lived objects

Previously, the frozen constants benefited from Ruby/Passenger's copy-on-write optimization — with 12 Passenger workers, there was effectively one shared copy of the data in physical RAM. The new approach could result in up to 12 simultaneous 7 MB Hash allocations per concurrent request wave, all being GC'd after each use.

Proposed Alternative: Value-Level Caching

Rather than caching at the chunk level and reassembling the whole Hash, I'd suggest caching at the individual key-value level. On S3 download, iterate through the Hash and write each entry as its own cache key. Lookups then go directly to Memcached with the specific key needed — no Hash assembly, no JSON parsing, no large allocations.

The warm-up cost (writing potentially tens of thousands of keys) is negligible because:

  • These mappings are only regenerated once a month by the sample-ror-analysis workflow
  • Warm-up happens via the rake task, not during live requests
  • Memcached can handle hundreds of thousands of writes per second

@ashwinisukale
Copy link
Contributor

Since reindexing processes millions of DOIs, value-level caching would result in millions of cache network calls. Instead, we could lazily load the mapping once per Shoryuken process and reuse it in memory, eliminating per-DOI network overhead while keeping boot-time unaffected. Given our current dataset size.

@jrhoads jrhoads changed the title Switch ROR mapping files from local resources + boot-time constants to S3-backed chunked Rails cache Refactor ROR reference data to use S3-backed dynamic caching Feb 25, 2026
@jrhoads
Copy link
Contributor

jrhoads commented Feb 25, 2026

Updating the description for the new approach

Copilot AI and others added 9 commits February 26, 2026 13:45
…ils cache

- Add RorReferenceStore service with chunked Memcached caching (512 KB chunks)
- Add ror:refresh_reference_cache rake task
- Update Rorable concern to use RorReferenceStore
- Remove load_ror_data.rb initializer and app/resources JSON files
- Update rorable_spec and add ror_reference_store_spec

Co-authored-by: ashwinisukale <[email protected]>
@jrhoads jrhoads force-pushed the copilot/switch-ror-mapping-to-s3-cache branch from 0e3fc6e to daebf99 Compare February 26, 2026 12:47
@jrhoads jrhoads marked this pull request as ready for review February 26, 2026 13:01
@jrhoads jrhoads requested a review from a team February 26, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants