-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs-cache
: Add Cache Struct
#95
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 603c1dbClick to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 20fe5a6Click to view benchmark
|
fs-cache/src/cache.rs
Outdated
/// Load most recent cached items into memory based on timestamps | ||
pub fn load_recent(&mut self) -> Result<()> { | ||
self.storage.load_fs() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we don't need to expose this function.
Only set
/get
API is needed
The rest should happen under the hood:
- any
set
should write both to memory and disk - one-way sync from disk to memory is needed when users
get
values - if we hit our own limit for bytes stored in the in-memory mapping, we erase oldest entries from it
- but entries are always stored on disk, no need to sync from memory to disk explicitly
Primary usage scenario: keys are of type ResourceId
- App indexes a folder.
- App may populate the cache before using it, but it's not required.
- App will query caches by key:
- if the entry is in memory already, that's great, we just return the value
- otherwise, we check disk for entry with the requested key
- if it is on disk, we add it to in-memory storage and return the value
- otherwise, we return
None
- Index can notify the app about recently discovered resources. Corresponding values can be in the cache already, but this is not required. App can initialize values for new resources.
Secondary usage scenario: keys are of arbitrary type
Can be any deterministic computation.
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 840a337Click to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 50fe163Click to view benchmark
|
Thank you for the review. |
// Try to load from disk | ||
let file_path = self.path.join(format!("{}.json", key)); | ||
if file_path.exists() { | ||
// Doubt: Update file's modiied time (in disk) on read to preserve LRU across app restarts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's track this feature and work on it later. Better to keep implementation simple for this moment and avoid redundant state. Btw we can also simply write cached keys into a file + apply atomic versioning on it, so all peers would have same view of LRU.
// Write a single value to disk | ||
fn write_value_to_disk(&mut self, key: &K, value: &V) -> Result<()> { | ||
let file_path = self.path.join(format!("{}.json", key)); | ||
let mut file = File::create(&file_path)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add debug_assert
that the file doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should use lightweight atomic writing to avoid dirty writing. Keep in mind scenario when several ARK apps on same device use same folder and write to the cache in parallel.
I believe, that atomic versions would be excessive here, but I'm not 100% sure yet.
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 42bb74cClick to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for c7341e8Click to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for d698fcfClick to view benchmark
|
fs-cache/src/cache.rs
Outdated
// TODO: NEED FIX | ||
memory_cache: LruCache::new( | ||
NonZeroUsize::new(max_memory_bytes) | ||
.expect("Capacity can't be zero"), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LruCache
requires the capacity (number of items) to be specified during initialization. However, our Cache is designed to be limited by max_memory_bytes. So, my question is: what would be the most way to initialize the LruCache with?
Note: In all other functions, we are already comparing based on the number of bytes, not the number of items.
I think we can create another parameter(max_items
) which will be Option<usize>
with default as 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes
as an argument, we could take max_memory_items
. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.
If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:
- We could implement our own version of
LruCache
- Or,
LruCache
has aresize()
method, and we could use this to resize the cache based on other metadata we track
Also, I looked into uluru
, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, what about this? https://docs.rs/lru-mem/latest/lru_mem/
But it has only 3 stars on GitHub..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually an interesting option and would have been a perfect fit 😃
I wouldn’t recommend it though because if we find any issues later in the crate, we’d have to fork it and fix the problem ourselves, and we’re not familiar with the code. Plus, since it's not actively maintained/ used, there wouldn’t be anyone around to help us either.
Benchmark for a683a1eClick to view benchmark
|
Benchmark for 92357c6Click to view benchmark
|
fs-cache/src/cache.rs
Outdated
// TODO: NEED FIX | ||
memory_cache: LruCache::new( | ||
NonZeroUsize::new(max_memory_bytes) | ||
.expect("Capacity can't be zero"), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes
as an argument, we could take max_memory_items
. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.
If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:
- We could implement our own version of
LruCache
- Or,
LruCache
has aresize()
method, and we could use this to resize the cache based on other metadata we track
Also, I looked into uluru
, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.
fs-cache/src/cache.rs
Outdated
// Remove oldest entries until we have space for new value | ||
while self.current_memory_bytes + size > self.max_memory_bytes { | ||
let (_, old_entry) = self | ||
.memory_cache | ||
.pop_lru() | ||
.expect("Cache should have entries to evict"); | ||
debug_assert!( | ||
self.current_memory_bytes >= old_entry.size, | ||
"Memory tracking inconsistency detected" | ||
); | ||
self.current_memory_bytes = self | ||
.current_memory_bytes | ||
.saturating_sub(old_entry.size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Yeah I think we should remove this code at all costs.
It’s currently undermining the purpose of using the external LRU cache crate. If there’s absolutely no other way around this, then we may need to implement our own LRU cache solution.
This operation should be O(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried replacing the existing logic with resize(). However, this operation is still not O(1) because lru.resize has a complexity of O(N).
Additionally, using resize requires extra calculations to determine the target_size for the LRU cache. Considering this overhead, I think the existing approach is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will explore the pros & cons of implementing our own LRU cache.
struct CacheEntry<V> { | ||
value: V, | ||
size: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to store the size of the value? Can’t we just read it from fs when needed? I don’t see it being read often.
If it’s for convenience to avoid I/O calls…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to track memory consumption in bytes to be precise about when to offload values, and to support large values, too. We probably can't avoid saving value sizes into memory, otherwise when we hit the limit we cannot know how much values we need to offload.
However, that's where we could split the crate into 2 flavours:
- dynamically-sized values e.g. byte vectors and text strings
- statically-sized values e.g. integers
For the 2nd flavour we could utilize some standard Rust trait.
Is there a way to have these 2 flavours combined nicely in a single crate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to track memory consumption in bytes to be precise about when to offload values...
Thtat's fine but we're actually reading the file size from the disk again in the get_file_size
method, even though we already have it stored in the metadata. That seems a bit wasteful. Check out the next comment for more on this
- dynamically-sized values e.g. byte vectors and text strings
That's a good point
I completely missed the dynamic types aspect when I looked at this. Now it makes a lot more sense why we need to track the data size instead of just the number of items.
Is there a way to have these 2 flavours combined nicely in a single crate?
If we're dealing with types that have a fixed size, like usize
, it doesn't really matter if we count how many items there are or just the total size they take up. But this completely breaks with the second flavour you mentioned.
The only solution I can think of right now is to treat all types of data as if they were the second type – basically, keep track of how much memory they use instead of how many there are. But that brings up the question of how to do this in a clean way
Main thread: #95 (comment)
log::debug!("cache/{}: caching in memory for key {}", self.label, key); | ||
let size = self.get_file_size(key)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... then we would essentially be defeating the purpose here, as we're reading the file size from disk again
} | ||
} | ||
|
||
// Sort by size before loading | ||
file_metadata.sort_by(|a, b| b.1.cmp(&a.1)); | ||
// Sort by modified time (most recent first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure that pre-loading the most recently modified values would really be beneficial.
We could implement more sophisticated approach with gathering query statistics and recording it somewhere on disk for pre-loading in future. But I would do it in a separate PR and not right now.
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for df8396cClick to view benchmark
|
Benchmark for c50cdbbClick to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 252b2ffClick to view benchmark
|
MemoryLimitedStorage
under the hood.