-
-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start blame from cache #1852
base: main
Are you sure you want to change the base?
Start blame from cache #1852
Conversation
Add an algorithm that takes an existing blame and diff changes and computes the new blame and unblamed hunks.
0173783
to
68abfa1
Compare
Thanks a lot for contributing! It's great to see what can be done with a cache and I'd love to see this go further. What if there was enough tooling around it so that This PR should of course remain minimal, it would just be interesting to make these changes with user-value in mind. @cruessler has been working towards a first integration into Speaking of, all I did was read the PR text and make CI pass, and I wonder if @cruessler would like to take a first closer look? Thanks everyone 🙏 |
I’ll be happy to have a look! (Just FYI, I’ll be travelling for a week starting next Tuesday, so it might take some time until I finally get to it.) |
68abfa1
to
904285c
Compare
@cruessler Is this something you'd like to see merged, or need changes to have it merged? Thanks again. |
I’ll respond in depth tomorrow! |
@holodorum @jtwaleson First: I absolutely love the idea of speeding up Recently, I’ve added This is all very high-level and I’m potentially missing important counter-arguments, but before diving deeper into an actual review, I wanted to make sure we’re choosing the best approach possible. |
Thanks a lot for your review, @cruessler, it's much appreciated! I see this as real opportunity, and think that together with @holodorum we have yet another valid perspective that can help to make this implementation an incredibly versatile one. Think about the recent Let's hope we can figure this out, I'd really want to avoid someone having to fork the codebase, and rather want to keep it all here and in the open to maximise the benefit of what seems to be a growing community of users. |
@cruessler , fyi, I've hired @holodorum to build the caching feature for my company, but now we have to rely on his voluntary time to chime in. I myself have not looked into the gitoxide changes that closely and am not an expert in the code. However, I understand the problem fundamentally as follows: a git blame is a File (with a Path) for a specific Commit, with for every line in the File, a pointer to the commit where this line originated. To store it a bit more efficiently, we can use a Now, I would like to do three things with this:
That being said, I don't know if the current implementation that works on |
Thanks for chiming in @jtwaleson. My response is probably only partial and maybe not very satisfying, but here we go. Right - previously I thought of This would turn That's how I can imagine it to be incremental, and inherently interruptible while naturally supporting caching without having to know the cache. There is probably no reason to not eventually have an own cache implementation, it's just that the place to store it and the exact format would probably be application defined (this is a plumbing crate after all). If nothing else then at least food for thought, and I am confident that we will eventually get there. @cruessler I wonder if this PR can be merged and refactored into its final shape, or if it should be closed instead, while making use of it as reference when considering to make the current implementation more incremental (eventually). (I am saying all of this without having looked at the implementation in detail) |
Thanks everyone for the input, and @cruessler — hope you had a good trip! 😉 Even though I was hired by @jtwaleson, I’d be more than happy to volunteer some time to help integrate this properly into gitoxide. Currently, @cruessler — if I understand you correctly, you're proposing to pull that cache-handling logic out of Regarding @Byron — is this roughly what you meant by “incremental” and “cache-friendly without the cache being built in”? Let me know if that matches your thinking, or if I’m misunderstanding something. If you decide what the best way forward is, I’ll try to implement it ASAP. |
It's great to have you here, @holodorum! Thanks to your summary I also realise that I might have added salt to the cake by sharing an idea that might be misaligned with what @cruessler was suggesting. |
Sorry for the long delay! There’s quite a few things I’m working on simultaneously that it becomes harder to do each of them justice. :-) My suggestion/idea was related to this comment #1848 (comment) that mentioned creating full blames for certain “checkpoints” and only having to do a blame until it gets to one of the checkpoints.
In the example above, let’s say there are three checkpoints, For a commit between What I was proposing was a separate function that would call I hope this makes more sense now! (Also, we can hop on a short call if it still doen’t. 😄) And definitely let me know when there are major downsides to this approach or constraints it doesn’t meet. |
As discussed in #1848, @jtwaleson and I have been working on a way to speed up git blame by introducing a caching mechanism. This allows us to start a blame operation from a checkpoint instead of computing it from scratch, significantly reducing computation time.
Proposed Changes
The function
function::file
now accepts aBlameCacheObject
, which stores:Commit ID at which the blame was previously computed.
Blame entries corresponding to that commit.
Using the cached data, we compute the differences between the cached blob and the new target blob at the suspect commit.
If the file has been rewritten, this will probably error, so the BlameCacheObject might need to store the file path as well.
Cached blame entries are updated based on detected changes.
Only
UnblamedHunks
(caused byAddedOrReplace
changes) are recomputed using the standard blame algorithm.Previously, the entire file or a range was marked as
UnblamedHunk
, but now this only happens when necessary.So far the results show significant speed-ups. These are results for the README file in the linux repo starting with a blame at commit
bf4401f3ec700e1a7376a4cbf05ef40c7ffce064
.Next Step's
BlameCacheObject
Curious to hear what you think!