Conversation
tkellogg
commented
Dec 2, 2022
Reduce disk I/O by moving most directory listing to a probabalistic cache approach. For one, I don't want to wear out disks by constantly accessing them. Spinning disks are especially problematic because latency can be quite high. This introduces `CachedDirIter` which replaces usages of `fs::ReadDir`. It's an enum that can either wrap a `fs::ReadDir` or represent a cache hit by iterating a vector af paths. The cache itself is a Trie, so it shouldn't take that much memory to hold in memory all paths under `$HOME`, for example. The cache lives at program scope and is passed down into where it's needed. Cache invalidation is a problem, though. I don't want to refresh the entire Trie all at once, but I also want guarantees that a new directory will be recognized within a certain time limit, e.g. within 10 minutes. Here I take a probabalistic approach. Each individual directory entry is invalidated independently. On any given pass, there's an `X%` chance that a single directory will be invalidated. `X` is calculated such that caches will be invalidated within some maximum time bound (10 minutes) 95% of the time. In the remaining 10% of cases, they're force-invalidated at the 10 minute mark. The duralite sub-project has been about reducing dura's presence on host machines by utilizing as few resources as possible. I don't want people to not use dura because "my computer runs slow with it". Something I've been observing is that, as I reduce these I/O-intensive bottlenecks, dura uses even more CPU. In a follow-up PR I want to add some strategically placed `thread::sleep`'s to spread out the CPU usage evenly throughout the entire 5 second polling interval.
Owner
Author
|
TODO: this needs more tests before it can be released into the wild |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduce disk I/O by moving most directory listing to a probabalistic
cache approach. For one, I don't want to wear out disks by constantly
accessing them. Spinning disks are especially problematic because
latency can be quite high.
This introduces
CachedDirIterwhich replaces usages offs::ReadDir.It's an enum that can either wrap a
fs::ReadDiror represent a cachehit by iterating a vector af paths. The cache itself is a Trie, so it
shouldn't take that much memory to hold in memory all paths under
$HOME, for example. The cache lives at program scope and is passeddown into where it's needed.
Cache invalidation is a problem, though. I don't want to refresh the
entire Trie all at once, but I also want guarantees that a new directory
will be recognized within a certain time limit, e.g. within 10 minutes.
Here I take a probabalistic approach. Each individual directory entry is
invalidated independently. On any given pass, there's an
X%chancethat a single directory will be invalidated.
Xis calculated such thatcaches will be invalidated within some maximum time bound (10 minutes)
95% of the time. In the remaining 10% of cases, they're
force-invalidated at the 10 minute mark.
The duralite sub-project has been about reducing dura's presence on host
machines by utilizing as few resources as possible. I don't want people
to not use dura because "my computer runs slow with it". Something I've
been observing is that, as I reduce these I/O-intensive bottlenecks,
dura uses even more CPU. In a follow-up PR I want to add some
strategically placed
thread::sleep's to spread out the CPU usageevenly throughout the entire 5 second polling interval.