Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cache configuration available from cpp api. #946

Open
mgautierfr opened this issue Jan 16, 2025 · 8 comments
Open

Make cache configuration available from cpp api. #946

mgautierfr opened this issue Jan 16, 2025 · 8 comments
Assignees

Comments

@mgautierfr
Copy link
Collaborator

After discussion with @kelson42 and @rgaudin about caching issues, it appears that:

  • We have issues with cache memory usage
  • Some small/fast reads are not so fast as we need to populate fast lookup array.
  • Different use cases imply different cache strategies and we should handle that.

As a remainder, here are the cache strategy in use for now:

  • Cluster cache : Store a number of compressed cluster in memory to avoid uncompressing the same cluster too many times.
    This number is controlled by the env var ZIM_CLUSTERCACHE.
    The memory used by this cache is not obvious as we do partial decompression. So, on top of the decompressed data, we also store the decompressor stream/context which store itself some data.
  • Dirent cache: Store a number of parsed dirent in memory to avoid parsing them too many times.
    This number is controlled by the env var ZIM_DIRENTCACHE.
    The memory used by this cache is not really known (mainly as each dirent have a variable size because of url/title size) but can be "easily" calculated at runtime.
  • Fast lookup cache: Store a fixed number of dirents evenly spread to define subrange in which to search.
    This number of ranges is controlled by the env var ZIM_DIRENTLOOKUPCACHE.
    Question of memory usuage is the same as for dirent cache but less important.
    Contrary to the other caches, this cache is fixed size and is fully populated at first access.
    This cache improve following readings but if there is really few readings after, populate the cache may slow down the whole process.
    The default value for ZIM_DIRENTLOOKUPCACHE being 1024, we have to prevent 1024 dirent reading to have this cache being efficient. This is almost impossible when doing only few reads (as getting the metadata from the zim file) only.

Proposition :

  • Introduce a CacheConfig structure which contains information about cache strategy (for now, size of the different caches)
  • Extend the libzim API to get this cache config at zim reader creation and use it instead of env vars.
  • Make FastDirentLookup "deactivate" itself if value is zero or one.
  • Keep compatible API which will use a cache config with the same value as default value of env vars (or else, we would have to do a major version)
  • Provide some default configs to help user (Classic cache, no cache...)

If a tool (zim-tools, kiwix-desktop, ...) still want to use env var to control caches, this should be implemented there.

Tools would have to be adapted to use this new feature. As we keep a compatible API, there is no need to adapt them right now and so they will not be adapted in this run.

Limiting the cache memory size is a whole more complex things as we would need to make the cache global to all opened zims (and so loading a dirent/cluster in a zim may imply dropping a cached dirent/cluster of another zim). I consider this as out of scope of this issue.

Testing:

Automatic testing is a bit complex here. We would have to mock the cache system to get information about what it cached or not and test that.
I will simply test the functional part and check we get the same results whatever the cache config is.

@mgautierfr
Copy link
Collaborator Author

Note that a lot of this as been discussed in #311
And a lot of questions there still apply (ie, we don't really know what we want).

This issue is mainly intended to answer one thing:
Be able to deactivate pre-populated cache in case of "short" read of a zim file.

@benoit74
Copy link

I'm not in favor of moving this issue forward until we have confirmed that tuning all these settings would help.

I have two main use-cases in mind:

  • library generation (for ) for which we've already set ZIM_DIRENTLOOKUPCACHE but would like to make it run even faster (if possible)
  • library.kiwix.org which memory keeps growing over and over

We could maybe also have a look at other readers, but I'm less experienced with them.

To me, the main problem currently is not really that these settings are not available from the cpp API, but rather that most tools creators (scrapers, library generation, many readers probably...) and ops do not know how to tweak these settings properly.

So I would prefer to start from the end:

  • Provide some default configs to help user (Classic cache, no cache...)
  • Confirm these configs are working properly on our use cases
  • If needed, make these available from the CPP API

@mgautierfr
Copy link
Collaborator Author

I'm not in favor of moving this issue forward until we have confirmed that tuning all these settings would help.

I was thinking it was already the case.

Provide some default configs to help user (Classic cache, no cache...)

Without knowing the use case and some investigation, it is difficult to provide specific configuration for each use case.

However, if you want to reduce cache at maximum, you can set ZIM_DIRENTCACHE and ZIM_CLUSTERCACHE to 0 and ZIM_DIRENTLOOKUPCACHE to 1.

Confirm these configs are working properly on our use cases

Yet again, I was thinking you have already done it.

@benoit74
Copy link

The only thing we've done so far is set ZIM_DIRENTLOOKUPCACHE to 1 for library generation. It definitely provided slightly better performance. Don't know if this is sufficient or if we need to dive deeper.

Then @kelson42 decided to setup some kind of task force / hackathon to dive into understanding better what could be done but unless I miss it, it never happened.

@kelson42
Copy link
Contributor

  • A proposition should be made ASAP about how we could deal with memory exhaustion problems. The software developer has no clue how much memory should be reserved for caches (because no clue of the hardware the software runs on) and the user neither (most of the time). We need a smart system.
  • Remove the ENV variables, things should be tunable via API. ENV variables are - at this stage - too complex and we have a lack of clear control on them. In addition this is primary not the role of the software user to tune this, but the one of the software developer.
  • Each of the 3 cache memory variables should be controlled via a dedicated method. Default values should stay the same.
  • Loading/Opening of ZIM files should be able to be customisable via a set of boolean options (via | operator). A new operator should be introduced to stop ZIM_DIRENTLOOKUPCACHE to be prefilled at start.
  • I like the approach via "profiles" and this might be part of the solution IMHO, but this mostly offer a solution about the "memory strategy", not about the "memory amount" (this my first point).

@mgautierfr
Copy link
Collaborator Author

A proposition should be made ASAP about how we could deal with memory exhaustion problems. The software developer has no clue how much memory should be reserved for caches (because no clue of the hardware the software runs on) and the user neither (most of the time). We need a smart system.

If neither the software developer nor the user knows how to configure a memory limit, it will be difficult to design a smart system.

Remove the ENV variables, things should be tunable via API. ENV variables are - at this stage - too complex and we have a lack of clear control on them.

To be clear. Moving configuration from ENV var to cpp var will NOT remove complexity.
At least it will be the same as the same questions still apply (What is the right value).
Else, it will need changes in cpp code using the libzim.
And designing a smart system is kind of rewrite the whole cache system and it is not a small part of libzim.

Each of the 3 cache memory variables should be controlled via a dedicated method.

Why ? That a cache must be configurable for specific use case yes.
That the cache must be configurable via a dedicated method, why ?

Loading/Opening of ZIM files should be able to be customisable via a set of boolean options (via | operator)

Designing this kind of API seems a bit premature as we still don't know what we want.

I like the approach via "profiles" and this might be part of the solution IMHO, but this mostly offer a solution about the "memory strategy", not about the "memory amount"

Yes, this is clearly stated in my proposition


Limiting the cache by memory is a whole different project (and far most complex):

  • We have to calculate the memory size of each item
  • The cache must be global to all opened zim files.
  • Cache invalidation must be totally rethink.

For now, all caches (different types and different zims) are independent. When we want to add an item to a cache and it is full, we simply drop to last used item in the cache.

If we reason by memory, dropping the last item may be not enough. We may drop enough last itemS to free enough memory. Maybe it is better to drop a "big but not last used" item than several "last but small".
Same question about multizim cache. Should be drop the last used cluster (for instance) whatever the zim file or we should prefer drop cluster of the same zim files ? What are the conditions to drop a cluster of zim file when opening a new zim file ?
And there will a resource (memory) conflict between different caches (dirent vs cluster). Can caching a new dirent remove a cluster from cache ?

I'm not against implementing such a system. But it should probably be a project in a whole and not just a small issue.

@kelson42
Copy link
Contributor

A proposition should be made ASAP about how we could deal with memory exhaustion problems. The software developer has no clue how much memory should be reserved for caches (because no clue of the hardware the software runs on) and the user neither (most of the time). We need a smart system.

If neither the software developer nor the user knows how to configure a memory limit, it will be difficult to design a smart system.

Requesting this information form the user is a no-go, so we will have to be smart. If we look to similar softwares, this is what a Web browser achieve to do AFAIK?! Seem doable to me.

Limiting the cache by memory is a whole different project (and far most complex):

* We have to calculate the memory size of each item

* The cache must be global to all opened zim files.

* Cache invalidation must be totally rethink.

Whatever is needed, we obviously need a tight control about cache memory consumption. This is something I expect to be treated now, and not in an unclear future.

For now, all caches (different types and different zims) are independent. When we want to add an item to a cache and it is full, we simply drop to last used item in the cache.

If we reason by memory, dropping the last item may be not enough. We may drop enough last itemS to free enough memory. Maybe it is better to drop a "big but not last used" item than several "last but small". Same question about multizim cache. Should be drop the last used cluster (for instance) whatever the zim file or we should prefer drop cluster of the same zim files ? What are the conditions to drop a cluster of zim file when opening a new zim file ? And there will a resource (memory) conflict between different caches (dirent vs cluster). Can caching a new dirent remove a cluster from cache ?

I'm not against implementing such a system. But it should probably be a project in a whole and not just a small issue.

All of these questions seem legitim, but if you think twice, none of them seem that hard to answer. I don't think that conceiving a first version of such a system takes more than a day... then of course a bit of time will be needed for the implementation. Remember, I'm not asking to build the smartest cache system. I'm just asking for a caching system which does not run out of memory and works in a reasonable manner.

@mgautierfr
Copy link
Collaborator Author

I have created #947 to discuss about memory limitation cache. I don't think there is a issue dedicated to this and this is a different subject (but close) than cache configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants