Start blame from cache #1852

holodorum · 2025-02-21T16:29:26Z

As discussed in #1848, @jtwaleson and I have been working on a way to speed up git blame by introducing a caching mechanism. This allows us to start a blame operation from a checkpoint instead of computing it from scratch, significantly reducing computation time.

Proposed Changes

Introduce BlameCacheObject
The function function::file now accepts a BlameCacheObject, which stores:
Commit ID at which the blame was previously computed.
Blame entries corresponding to that commit.
Detect and Process Changes
Using the cached data, we compute the differences between the cached blob and the new target blob at the suspect commit.
If the file has been rewritten, this will probably error, so the BlameCacheObject might need to store the file path as well.
Efficiently Update Blame Entries
Cached blame entries are updated based on detected changes.
Only UnblamedHunks (caused by AddedOrReplace changes) are recomputed using the standard blame algorithm.
Previously, the entire file or a range was marked as UnblamedHunk, but now this only happens when necessary.

So far the results show significant speed-ups. These are results for the README file in the linux repo starting with a blame at commit bf4401f3ec700e1a7376a4cbf05ef40c7ffce064.

Performing blame operations
Elapsed time for blame on bf4401f3ec700e1a7376a4cbf05ef40c7ffce064: 6604ms
Statistics: Statistics { commits_traversed: 18008, trees_decoded: 18030, trees_diffed: 6, blobs_diffed: 5 }

Performing blame with cache
Elapsed time for blame on 8c93d454027ffceea663ce6ea5b87557b8aaeb8a: 4ms
Statistics: Statistics { commits_traversed: 0, trees_decoded: 2, trees_diffed: 0, blobs_diffed: 1 }

Performing blame without cache
Elapsed time for blame on 8c93d454027ffceea663ce6ea5b87557b8aaeb8a: 313ms
Statistics: Statistics { commits_traversed: 20592, trees_decoded: 20620, trees_diffed: 8, blobs_diffed: 7 } 

time git blame README >/dev/null
Blaming lines: 100% (14/14), done.
git blame README > /dev/null  0.26s user 0.21s system 33% cpu 1.382 total

Next Step's

Add tests that compare the outcome of a blame with and without cache
Add filepath to BlameCacheObject

Curious to hear what you think!

Byron · 2025-02-22T18:34:24Z

Thanks a lot for contributing! It's great to see what can be done with a cache and I'd love to see this go further. What if there was enough tooling around it so that gitui could build such a cache in memory to allow walking through/digging into blames of the same file on the fly. For instance, if that one commit changed all lines it changed tabs to spaces it should be possible to quickly 'go through' that veil without recomputing everything up to that point.

This PR should of course remain minimal, it would just be interesting to make these changes with user-value in mind.

@cruessler has been working towards a first integration into gitui as well, hence the note above.

Speaking of, all I did was read the PR text and make CI pass, and I wonder if @cruessler would like to take a first closer look?

Thanks everyone 🙏

cruessler · 2025-02-22T18:56:45Z

I’ll be happy to have a look! (Just FYI, I’ll be travelling for a week starting next Tuesday, so it might take some time until I finally get to it.)

Byron · 2025-03-22T09:14:48Z

@cruessler Is this something you'd like to see merged, or need changes to have it merged? Thanks again.

cruessler · 2025-03-22T19:16:38Z

@cruessler Is this something you'd like to see merged, or need changes to have it merged? Thanks again.

I’ll respond in depth tomorrow!

cruessler · 2025-03-23T12:01:02Z

@holodorum @jtwaleson First: I absolutely love the idea of speeding up gix blame through a cache, and I also like that it seems reasonably straightforward to plug a cache directly into blame::file. I was wondering, though, whether there might be ways of building a cached version on top of blame::file instead of into it. I see reduced API surface and decreased implementation complexity of blame::file as the main benefits.

Recently, I’ve added since as an option to blame::file, and I could see a separate version of this function that takes care of managing the cache and calling out to blame::file with specific combinations of since and range (another option you can pass to blame::file) to backfill missing pieces. If it would be necessary or helpful, we could also add a field is_boundary: bool to BlameEntry, to be used in connection with since to indicate whether a BlameEntry was already successfully blamed on a source file or whether blaming would need to continue beyond the boundary to get to the source file. git blame also keeps track of that piece of information when running partial blames (partial in the sense that there’s certain boundaries at which git will stop the blame, e. g. through the use of --since). Right now, this is just an idea, but would you be interested in considering it?

This is all very high-level and I’m potentially missing important counter-arguments, but before diving deeper into an actual review, I wanted to make sure we’re choosing the best approach possible.

Byron · 2025-03-24T00:35:14Z

Thanks a lot for your review, @cruessler, it's much appreciated! gitoxide plumbing crates also have the notion of Outcome return types which provide a lot of details about the invoked function. What I get from the above comment is that it might be possible to design gix-blame in such a way that it's cache-friendly without the cache being built in. This could be achieved by passing &mut State that makes it 'more' resumable, or by designing some Outcome along with its input Context in such a way that the work it does can be controlled precisely.
This certainly means complexity for gix-blame as well, but it can also be a step towards a more incremental approach that is more suitable for (streaming) user interfaces.

I see this as real opportunity, and think that together with @holodorum we have yet another valid perspective that can help to make this implementation an incredibly versatile one. Think about the recent helix PR that adds git-lens-like functionality - how great could that be if it was cashed, along with aspects of incremental computation, and maybe also with aspects of easily 'pushing through the boundary' to get rid of those pesky "someone formatted the whole codebase" commits.

Let's hope we can figure this out, I'd really want to avoid someone having to fork the codebase, and rather want to keep it all here and in the open to maximise the benefit of what seems to be a growing community of users.

jtwaleson · 2025-03-25T12:19:11Z

@cruessler , fyi, I've hired @holodorum to build the caching feature for my company, but now we have to rely on his voluntary time to chime in. I myself have not looked into the gitoxide changes that closely and am not an expert in the code.

However, I understand the problem fundamentally as follows: a git blame is a File (with a Path) for a specific Commit, with for every line in the File, a pointer to the commit where this line originated. To store it a bit more efficiently, we can use a (start_line, end_line) -> commit_id map.

Now, I would like to do three things with this:

Create a cache that stores roughly Map<(commit, file_path), Map<(start_line, end_line), ObjectId>> . For the README.md in the root commit, this would be {(<root_id>, 'README.md'): {(0, <len of file>): <root_id>}}
Calculate the blame forwards, so start at the root and populate the cache for every file. This can be done outside of the gitoxide codebase (not sure if we want that included).
Adjust the gix blame algorithm, so that if it looks at the parents and sees that a cache entry exists for them, then it stops the calculation there and just takes the cached as the intermediate result. Only the difference between the requested blame commit and the cached parent will be calculated.

That being said, I don't know if the current implementation that works on file is the most suitable. I think it was the most pragmatic solution that @holodorum could get to work in a reasonable amount of time. If you think that there is a better solution, then I am all for it!

Byron · 2025-03-26T01:06:41Z

Thanks for chiming in @jtwaleson. My response is probably only partial and maybe not very satisfying, but here we go.

Right - previously I thought of gix-blame only as something to support building the cache by being very flexible and incremental, but I didn't think of it as using the cache explicitly. Maybe it wouldn't have to if it would be so incremental that the caller could always check if the next unblamed set of lines is already included in a cache that they have, possibly even filling in that information to the blame-state so the next call would make use of it.

This would turn gix-blame into a function whose state would be passed in and which would be called until everything is blamed and each call is a no-op, while exposing all of its state to the caller so they can fill things in.

That's how I can imagine it to be incremental, and inherently interruptible while naturally supporting caching without having to know the cache.

There is probably no reason to not eventually have an own cache implementation, it's just that the place to store it and the exact format would probably be application defined (this is a plumbing crate after all).

If nothing else then at least food for thought, and I am confident that we will eventually get there.

@cruessler I wonder if this PR can be merged and refactored into its final shape, or if it should be closed instead, while making use of it as reference when considering to make the current implementation more incremental (eventually).

(I am saying all of this without having looked at the implementation in detail)

holodorum · 2025-03-26T22:38:31Z

Thanks everyone for the input, and @cruessler — hope you had a good trip! 😉

Even though I was hired by @jtwaleson, I’d be more than happy to volunteer some time to help integrate this properly into gitoxide.

Currently, blame::file receives a blame_cache: Option<BlameCacheObject>, and the logic for separating cached and uncached blames is implemented directly in blame::file. Cached blames are updated to match the new file, and the remaining uncached lines are marked as UnblamedHunk.

@cruessler — if I understand you correctly, you're proposing to pull that cache-handling logic out of blame::file into a separate function. That new function would handle determining which parts of the file are covered by the cache and which parts still need to be blamed. It could then call blame::file only for the UnblamedHunks, passing in the relevant range and possibly a since argument (though maybe since isn’t necessary, since we already know the cached commit is the earliest relevant one?).

Regarding is_boundary, I wasn’t entirely sure what you meant there. Could you clarify?

@Byron — is this roughly what you meant by “incremental” and “cache-friendly without the cache being built in”?

Let me know if that matches your thinking, or if I’m misunderstanding something. If you decide what the best way forward is, I’ll try to implement it ASAP.

Byron · 2025-03-26T22:59:11Z

It's great to have you here, @holodorum!

Thanks to your summary I also realise that I might have added salt to the cake by sharing an idea that might be misaligned with what @cruessler was suggesting.
So I will take myself out of there as his opinion is the one that should count here. (Even though I do have a vision of what it could be, I am sure it can be realised in various ways)

cruessler · 2025-04-03T15:26:16Z

Sorry for the long delay! There’s quite a few things I’m working on simultaneously that it becomes harder to do each of them justice. :-)

My suggestion/idea was related to this comment #1848 (comment) that mentioned creating full blames for certain “checkpoints” and only having to do a blame until it gets to one of the checkpoints.

    a   b   c   d
<---x---x---x---x

In the example above, let’s say there are three checkpoints, a, b and c, with d being the initial commit (the history starting at d and going to the left). For c, the cache would need to store full blames, starting at c and potentially going back to d (but not further since d is the initial commit). For b (and a, respectively), the cache would not need to store the full blame, but only the blame going from b to c at most. Doing a blame for any commit between b and c then becomes the equivalent to git blame c..<commit-in-question> at which point you could start reading the rest of the blame from cache.

For a commit between a and b, you would blame until you reach the cache at b, then you would read from the cache. In order to know whether you would need to continue reading the cache from c or whether a hunk originated somewhere between b and c in which case you were done for that hunk, you could use an additional boolean flag to store that piece of information (this is what I was referencing above when I mentioned is_boundary, this is the piece of information indicated by the ^ in the output of git blame --since="40 weeks ago" Cargo.toml).

What I was proposing was a separate function that would call blame::file(since: <datetime of entries at checkpoint a>) in order to get UnblamedHunks until checkpoint a, then continuing from there by reading from the appropriate caches.

I hope this makes more sense now! (Also, we can hop on a short call if it still doen’t. 😄) And definitely let me know when there are major downsides to this approach or constraints it doesn’t meet.

Byron · 2025-04-21T15:05:02Z

It appears this PR is degenerating and I wonder if @cruessler would be able to make the changes that make it mergeable?
Otherwise, is it clear what could or should be done to get it merged?
My feeling is that it will get harder over time to salvage the fantastic work done here.
Thanks again 🙏.

holodorum · 2025-04-21T15:28:18Z

@cruessler and me had a good call about this PR last week and it seems clear what needs to happen.

First step is to add a feature that allows for the blaming of multiple ranges, similar to git's functionality e.g.
'''
git blame -L 1,3 -L 5,9
'''
Using this we can add a function that takes the cache and does two things:

speed up the blame by using and updating the cached lines.
pass the uncached ranges to the 'blame::file' function.

This should have minimal impact on the existing API and still keep the existing functionality or easily improve it over time.

Hope this sounds good to you too @Byron. I've started with it and hope to finish it later this week.

Byron · 2025-04-21T20:07:33Z

Thanks so much for the update, I am excited to see this PR merged :)!
And special thanks to @cruessler for making this possible!

This commit includes a major overhaul of the blame operation logic. A new `BlameState` struct and a `BlameProcessor` struct were introduced to cleanly encapsulate different parts of the operation. The `BlameState` struct collects information about the blame operation as it is happening, while the `BlameProcessor` struct carries the bulk of the blame operation logic.

In this commit, we introduce the ability to execute a blame operation from a previously generated checkpoint. This functionality makes the computation of incremental blame operations more efficient by reusing already computed information. The refactoring performed in commit id d22965f allowed effortless integration of this feature into the existing blame processing algorithm. The fundamental distinction between a regular blame and a blame operation from a checkpoint is that in the latter, we can pre-fill the `BlameState` with some `BlameEntry` instances, thereby reducing the number of `UnblamedHunk` instances we need to process. This update’s algorithm for incorporating the detected modifications since the checkpoint into the blame entries is encapsulated in the `update_checkpoint_blames_with_changes` method. The newly introduced `BlameCheckpoint` type is a public type that users can utilize to establish a checkpoint state for their blame operations.

holodorum · 2025-04-28T21:47:06Z

Finally got around to implement it in the way that @cruessler and me discussed.
To implement it as DRY as possible I did have to do quite a big and maybe invasive refactor. Probably there is quite a lot to improve there, but if we don't pull apart the current implementation it becomes quite messy to add a fucntionality to start the blaming from existing blames.

Without the refactoring there would be quite some duplication in the code and actual operations performed. Since we would first do the work to calculate which blames can be updated and which ranges have to be passed to the blame::file function, in the blame::file we once again do the work to create UnblamedHunks.

Another discussion point might be what we want to call this feature. We are continuing work from a previous point in time, so that's why I thought checkpoint might be appropriate. At the same time, the BlameEntry's have been cached.

Curious to hear your thoughts!

Edit: I did make some tests that Pass, but couldn't fully figure out how the tests are structured and how to maximally leverage the existing structures.

Byron · 2025-04-29T05:04:29Z

Thanks a lot for making this happen!

I will leave the first review to @cruessler entirely, but can roughly say that the changes to gix-blame would need a commit with feat!: … in the subject to indicate breaking changes. When doing that, cargo fmt should do the trick with CI as well.

Byron force-pushed the cache-blame-pr branch from 0173783 to 68abfa1 Compare February 22, 2025 18:28

Byron force-pushed the cache-blame-pr branch from 68abfa1 to 904285c Compare February 22, 2025 19:05

holodorum closed this Apr 13, 2025

holodorum deleted the cache-blame-pr branch April 13, 2025 20:48

holodorum restored the cache-blame-pr branch April 13, 2025 20:51

holodorum reopened this Apr 13, 2025

holodorum added 2 commits April 28, 2025 23:11

holodorum force-pushed the cache-blame-pr branch from 904285c to cba93ab Compare April 28, 2025 21:26

Uh oh!

Start blame from cache #1852

Are you sure you want to change the base?

Start blame from cache #1852

Uh oh!

Conversation

holodorum commented Feb 21, 2025

Uh oh!

Byron commented Feb 22, 2025

Uh oh!

cruessler commented Feb 22, 2025

Uh oh!

Byron commented Mar 22, 2025

Uh oh!

cruessler commented Mar 22, 2025

Uh oh!

cruessler commented Mar 23, 2025

Uh oh!

Byron commented Mar 24, 2025

Uh oh!

jtwaleson commented Mar 25, 2025

Uh oh!

Byron commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holodorum commented Mar 26, 2025

Uh oh!

Byron commented Mar 26, 2025

Uh oh!

cruessler commented Apr 3, 2025

Uh oh!

Byron commented Apr 21, 2025

Uh oh!

holodorum commented Apr 21, 2025

Uh oh!

Byron commented Apr 21, 2025

Uh oh!

holodorum commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Apr 29, 2025

Uh oh!

Uh oh!

Byron commented Mar 26, 2025 •

edited

Loading

holodorum commented Apr 28, 2025 •

edited

Loading