Refactor ROR reference data to use S3-backed dynamic caching#1486
Refactor ROR reference data to use S3-backed dynamic caching#1486
Conversation
Performance Concern: Per-Lookup Hash Assembly at ScaleI have a concern about the chunked caching design given how these mappings are actually used in production. The ProblemThe current implementation assembles the entire mapping Hash on every single lookup:
This is called from within Shuriken workers and can run several million times. At that scale this means:
Previously, the frozen constants benefited from Ruby/Passenger's copy-on-write optimization — with 12 Passenger workers, there was effectively one shared copy of the data in physical RAM. The new approach could result in up to 12 simultaneous 7 MB Hash allocations per concurrent request wave, all being GC'd after each use. Proposed Alternative: Value-Level CachingRather than caching at the chunk level and reassembling the whole Hash, I'd suggest caching at the individual key-value level. On S3 download, iterate through the Hash and write each entry as its own cache key. Lookups then go directly to Memcached with the specific key needed — no Hash assembly, no JSON parsing, no large allocations. The warm-up cost (writing potentially tens of thousands of keys) is negligible because:
|
|
Since reindexing processes millions of DOIs, value-level caching would result in millions of cache network calls. Instead, we could lazily load the mapping once per Shoryuken process and reuse it in memory, eliminating per-DOI network overhead while keeping boot-time unaffected. Given our current dataset size. |
|
Updating the description for the new approach |
…ils cache - Add RorReferenceStore service with chunked Memcached caching (512 KB chunks) - Add ror:refresh_reference_cache rake task - Update Rorable concern to use RorReferenceStore - Remove load_ror_data.rb initializer and app/resources JSON files - Update rorable_spec and add ror_reference_store_spec Co-authored-by: ashwinisukale <[email protected]>
…ve nil assertions
0e3fc6e to
daebf99
Compare
Issue - https://github.com/datacite/product-backlog/issues/672
Purpose
The purpose of this PR is to refactor how ROR (Research Organization Registry) reference data is stored and accessed. It moves away from
loading large JSON files into memory at boot time and instead implements a dynamic caching strategy that fetches data from S3.
Approach
The implementation introduces a new service,
RorReferenceStore, which manages the retrieval and caching of ROR mappings. Instead of loading global constantsFUNDER_TO_RORandROR_HIERARCHYfrom local files, the system now performs per-key lookups against the Rails cache (e.g., Redis or Memcached). If the cache is cold, it downloads the mapping files from an S3 bucket and populates the cache.Key Modifications
app/services/ror_reference_store.rbto handle S3 downloads and cache management.app/models/concerns/rorable.rbto use the new service for ROR lookups instead of local constants.config/initializers/load_ror_data.rband references to local JSON files inapp/resources/(assumed, as the initializer was removed).lib/tasks/ror_reference_cache.raketo allow manual or scheduled refreshing of the cache from S3.RorReferenceStoreand updatedrorable_spec.rbto mock the new service.Important Technical Details
ror_ref/funder_to_ror/<id>). This prevents high memory overhead and "Fat" cache entries.populatedkey suffix to determine if a cache miss is due to a missing value or a cold cache that needs refreshing from S3.ROR_ANALYSIS_S3_BUCKETenvironment variable to be set.Types of changes
Reviewer, please remember our guidelines: