Skip to content

Make cache configuration available from cpp api. #946

Closed
@mgautierfr

Description

@mgautierfr

After discussion with @kelson42 and @rgaudin about caching issues, it appears that:

  • We have issues with cache memory usage
  • Some small/fast reads are not so fast as we need to populate fast lookup array.
  • Different use cases imply different cache strategies and we should handle that.

As a remainder, here are the cache strategy in use for now:

  • Cluster cache : Store a number of compressed cluster in memory to avoid uncompressing the same cluster too many times.
    This number is controlled by the env var ZIM_CLUSTERCACHE.
    The memory used by this cache is not obvious as we do partial decompression. So, on top of the decompressed data, we also store the decompressor stream/context which store itself some data.
  • Dirent cache: Store a number of parsed dirent in memory to avoid parsing them too many times.
    This number is controlled by the env var ZIM_DIRENTCACHE.
    The memory used by this cache is not really known (mainly as each dirent have a variable size because of url/title size) but can be "easily" calculated at runtime.
  • Fast lookup cache: Store a fixed number of dirents evenly spread to define subrange in which to search.
    This number of ranges is controlled by the env var ZIM_DIRENTLOOKUPCACHE.
    Question of memory usuage is the same as for dirent cache but less important.
    Contrary to the other caches, this cache is fixed size and is fully populated at first access.
    This cache improve following readings but if there is really few readings after, populate the cache may slow down the whole process.
    The default value for ZIM_DIRENTLOOKUPCACHE being 1024, we have to prevent 1024 dirent reading to have this cache being efficient. This is almost impossible when doing only few reads (as getting the metadata from the zim file) only.

Proposition :

  • Introduce a CacheConfig structure which contains information about cache strategy (for now, size of the different caches)
  • Extend the libzim API to get this cache config at zim reader creation and use it instead of env vars.
  • Make FastDirentLookup "deactivate" itself if value is zero or one.
  • Keep compatible API which will use a cache config with the same value as default value of env vars (or else, we would have to do a major version)
  • Provide some default configs to help user (Classic cache, no cache...)

If a tool (zim-tools, kiwix-desktop, ...) still want to use env var to control caches, this should be implemented there.

Tools would have to be adapted to use this new feature. As we keep a compatible API, there is no need to adapt them right now and so they will not be adapted in this run.

Limiting the cache memory size is a whole more complex things as we would need to make the cache global to all opened zims (and so loading a dirent/cluster in a zim may imply dropping a cached dirent/cluster of another zim). I consider this as out of scope of this issue.

Testing:

Automatic testing is a bit complex here. We would have to mock the cache system to get information about what it cached or not and test that.
I will simply test the functional part and check we get the same results whatever the cache config is.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions