Skip to content

Reduce LRU image cache's memory footprint#46

Open
nothingface0 wants to merge 2 commits intodev128from
refactor_image_lru_cache
Open

Reduce LRU image cache's memory footprint#46
nothingface0 wants to merge 2 commits intodev128from
refactor_image_lru_cache

Conversation

@nothingface0
Copy link
Copy Markdown
Contributor

@nothingface0 nothingface0 commented Dec 9, 2025

Investigating #41, we found that the high memory usage is stemming from the lruimg cache kept by the server, which keeps the last 1024 rendered images in memory.

This cache stores, among other things, the databytes of the histogram used to render the image. Upon investigation, it looks like these are not strictly needed, nor reused, unlike the pngbytes, which store the actual rendered image. This PR modifies the logic to clear the databytes of the cached images completely, trying to reduce overall memory usage.

Warning

One other thing to note is that, the changes introduced by the PR also affect the way that the cache hit check is done (here): databytes are no longer considered during this check, as they will be empty. This should not, in general, affect the cache performance. One corner case scenario that will affect its correctness, however, will be the following:
1. A histogram is rendered and cached in the LRU cache.
2. The same histogram is updated in the DQMGUI's index, e.g., by uploading a newer file containing the same histogram. The cache still contains the previous version of the histogram at this point.
3. The user requests a refresh of the page containing the histogram. The cache checks whether the histogram is already rendered, and does not consider the updated databytes, which are due to the new version of the histogram, giving a false cache hit, and returns the cached, older version of the histogram.

Tests were done with:

  • An exceptionally huge ROOT histogram which, when unpacked, reaches ~960 MB in memory.
  • DQMNet's MESSAGE_SIZE_LIMIT increased to 1GB to accommodate the above change.
  • O0 optimization
  • valgrind --tool=massif --num-callers=500 --error-limit=no --trace-children=yes --suppressions=/data/srv/root/v6-28-10/etc/valgrind-root.supp --suppressions=/data/srv/root/v6-28-10/etc/valgrind-root-python.supp --suppressions=/usr/libexec/valgrind/default.supp --vgdb=yes

Further optimizations (TODO?):

  • Prevent loading the data from the index before rendering, unless the histogram actually needs to be rendered, avoiding unnecessary mem usage spikes, when the historgams are in cache.

Performance

Baseline memory usage

image

With the PR

image

To avoid copying data
unnecessarily altogether
@gabrielmscampos
Copy link
Copy Markdown
Member

Will this also fix the issue with the offline/relval DQMGUI flavour going down from time to time due to high memory usage?

@nothingface0
Copy link
Copy Markdown
Contributor Author

Will this also fix the issue with the offline/relval DQMGUI flavour going down from time to time due to high memory usage?

Probably not, the size of the cache, as it is in production, does not, in principle, exceed 8GB.

@gabrielmscampos
Copy link
Copy Markdown
Member

So what is this fixing exactly, the HGCAL issue when loading multiple TH2Poly objects in one page? I'm a little bit worried about this:

One corner case scenario that will affect its correctness.... and returns the cached, older version of the histogram.

This is unlikely to be a big deal in offline/relval, but in Online the shifters are constantly looking for new plots. Could you elaborate more on this corner case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants