Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start blame from cache #1852

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

holodorum
Copy link

As discussed in #1848, @jtwaleson and I have been working on a way to speed up git blame by introducing a caching mechanism. This allows us to start a blame operation from a checkpoint instead of computing it from scratch, significantly reducing computation time.

Proposed Changes

  1. Introduce BlameCacheObject
    The function function::file now accepts a BlameCacheObject, which stores:
    Commit ID at which the blame was previously computed.
    Blame entries corresponding to that commit.
  2. Detect and Process Changes
    Using the cached data, we compute the differences between the cached blob and the new target blob at the suspect commit.
    If the file has been rewritten, this will probably error, so the BlameCacheObject might need to store the file path as well.
  3. Efficiently Update Blame Entries
    Cached blame entries are updated based on detected changes.
    Only UnblamedHunks (caused by AddedOrReplace changes) are recomputed using the standard blame algorithm.
    Previously, the entire file or a range was marked as UnblamedHunk, but now this only happens when necessary.

So far the results show significant speed-ups. These are results for the README file in the linux repo starting with a blame at commit bf4401f3ec700e1a7376a4cbf05ef40c7ffce064.

Performing blame operations
Elapsed time for blame on bf4401f3ec700e1a7376a4cbf05ef40c7ffce064: 6604ms
Statistics: Statistics { commits_traversed: 18008, trees_decoded: 18030, trees_diffed: 6, blobs_diffed: 5 }

Performing blame with cache
Elapsed time for blame on 8c93d454027ffceea663ce6ea5b87557b8aaeb8a: 4ms
Statistics: Statistics { commits_traversed: 0, trees_decoded: 2, trees_diffed: 0, blobs_diffed: 1 }

Performing blame without cache
Elapsed time for blame on 8c93d454027ffceea663ce6ea5b87557b8aaeb8a: 313ms
Statistics: Statistics { commits_traversed: 20592, trees_decoded: 20620, trees_diffed: 8, blobs_diffed: 7 } 

time git blame README >/dev/null
Blaming lines: 100% (14/14), done.
git blame README > /dev/null  0.26s user 0.21s system 33% cpu 1.382 total

Next Step's

  • Add tests that compare the outcome of a blame with and without cache
  • Add filepath to BlameCacheObject

Curious to hear what you think!

Add an algorithm that takes an existing blame and diff changes and computes the new blame and unblamed hunks.
@Byron
Copy link
Member

Byron commented Feb 22, 2025

Thanks a lot for contributing! It's great to see what can be done with a cache and I'd love to see this go further. What if there was enough tooling around it so that gitui could build such a cache in memory to allow walking through/digging into blames of the same file on the fly. For instance, if that one commit changed all lines it changed tabs to spaces it should be possible to quickly 'go through' that veil without recomputing everything up to that point.

This PR should of course remain minimal, it would just be interesting to make these changes with user-value in mind.

@cruessler has been working towards a first integration into gitui as well, hence the note above.

Speaking of, all I did was read the PR text and make CI pass, and I wonder if @cruessler would like to take a first closer look?

Thanks everyone 🙏

@cruessler
Copy link
Contributor

I’ll be happy to have a look! (Just FYI, I’ll be travelling for a week starting next Tuesday, so it might take some time until I finally get to it.)

@Byron
Copy link
Member

Byron commented Mar 22, 2025

@cruessler Is this something you'd like to see merged, or need changes to have it merged? Thanks again.

@cruessler
Copy link
Contributor

@cruessler Is this something you'd like to see merged, or need changes to have it merged? Thanks again.

I’ll respond in depth tomorrow!

@cruessler
Copy link
Contributor

@holodorum @jtwaleson First: I absolutely love the idea of speeding up gix blame through a cache, and I also like that it seems reasonably straightforward to plug a cache directly into blame::file. I was wondering, though, whether there might be ways of building a cached version on top of blame::file instead of into it. I see reduced API surface and decreased implementation complexity of blame::file as the main benefits.

Recently, I’ve added since as an option to blame::file, and I could see a separate version of this function that takes care of managing the cache and calling out to blame::file with specific combinations of since and range (another option you can pass to blame::file) to backfill missing pieces. If it would be necessary or helpful, we could also add a field is_boundary: bool to BlameEntry, to be used in connection with since to indicate whether a BlameEntry was already successfully blamed on a source file or whether blaming would need to continue beyond the boundary to get to the source file. git blame also keeps track of that piece of information when running partial blames (partial in the sense that there’s certain boundaries at which git will stop the blame, e. g. through the use of --since). Right now, this is just an idea, but would you be interested in considering it?

This is all very high-level and I’m potentially missing important counter-arguments, but before diving deeper into an actual review, I wanted to make sure we’re choosing the best approach possible.

@Byron
Copy link
Member

Byron commented Mar 24, 2025

Thanks a lot for your review, @cruessler, it's much appreciated! gitoxide plumbing crates also have the notion of Outcome return types which provide a lot of details about the invoked function. What I get from the above comment is that it might be possible to design gix-blame in such a way that it's cache-friendly without the cache being built in. This could be achieved by passing &mut State that makes it 'more' resumable, or by designing some Outcome along with its input Context in such a way that the work it does can be controlled precisely.
This certainly means complexity for gix-blame as well, but it can also be a step towards a more incremental approach that is more suitable for (streaming) user interfaces.

I see this as real opportunity, and think that together with @holodorum we have yet another valid perspective that can help to make this implementation an incredibly versatile one. Think about the recent helix PR that adds git-lens-like functionality - how great could that be if it was cashed, along with aspects of incremental computation, and maybe also with aspects of easily 'pushing through the boundary' to get rid of those pesky "someone formatted the whole codebase" commits.

Let's hope we can figure this out, I'd really want to avoid someone having to fork the codebase, and rather want to keep it all here and in the open to maximise the benefit of what seems to be a growing community of users.

@jtwaleson
Copy link

@cruessler , fyi, I've hired @holodorum to build the caching feature for my company, but now we have to rely on his voluntary time to chime in. I myself have not looked into the gitoxide changes that closely and am not an expert in the code.

However, I understand the problem fundamentally as follows: a git blame is a File (with a Path) for a specific Commit, with for every line in the File, a pointer to the commit where this line originated. To store it a bit more efficiently, we can use a (start_line, end_line) -> commit_id map.

Now, I would like to do three things with this:

  • Create a cache that stores roughly Map<(commit, file_path), Map<(start_line, end_line), ObjectId>> . For the README.md in the root commit, this would be {(<root_id>, 'README.md'): {(0, <len of file>): <root_id>}}
  • Calculate the blame forwards, so start at the root and populate the cache for every file. This can be done outside of the gitoxide codebase (not sure if we want that included).
  • Adjust the gix blame algorithm, so that if it looks at the parents and sees that a cache entry exists for them, then it stops the calculation there and just takes the cached as the intermediate result. Only the difference between the requested blame commit and the cached parent will be calculated.

That being said, I don't know if the current implementation that works on file is the most suitable. I think it was the most pragmatic solution that @holodorum could get to work in a reasonable amount of time. If you think that there is a better solution, then I am all for it!

@Byron
Copy link
Member

Byron commented Mar 26, 2025

Thanks for chiming in @jtwaleson. My response is probably only partial and maybe not very satisfying, but here we go.

Right - previously I thought of gix-blame only as something to support building the cache by being very flexible and incremental, but I didn't think of it as using the cache explicitly. Maybe it wouldn't have to if it would be so incremental that the caller could always check if the next unblamed set of lines is already included in a cache that they have, possibly even filling in that information to the blame-state so the next call would make use of it.

This would turn gix-blame into a function whose state would be passed in and which would be called until everything is blamed and each call is a no-op, while exposing all of its state to the caller so they can fill things in.

That's how I can imagine it to be incremental, and inherently interruptible while naturally supporting caching without having to know the cache.

There is probably no reason to not eventually have an own cache implementation, it's just that the place to store it and the exact format would probably be application defined (this is a plumbing crate after all).

If nothing else then at least food for thought, and I am confident that we will eventually get there.

@cruessler I wonder if this PR can be merged and refactored into its final shape, or if it should be closed instead, while making use of it as reference when considering to make the current implementation more incremental (eventually).

(I am saying all of this without having looked at the implementation in detail)

@holodorum
Copy link
Author

Thanks everyone for the input, and @cruessler — hope you had a good trip! 😉

Even though I was hired by @jtwaleson, I’d be more than happy to volunteer some time to help integrate this properly into gitoxide.

Currently, blame::file receives a blame_cache: Option<BlameCacheObject>, and the logic for separating cached and uncached blames is implemented directly in blame::file. Cached blames are updated to match the new file, and the remaining uncached lines are marked as UnblamedHunk.

@cruessler — if I understand you correctly, you're proposing to pull that cache-handling logic out of blame::file into a separate function. That new function would handle determining which parts of the file are covered by the cache and which parts still need to be blamed. It could then call blame::file only for the UnblamedHunks, passing in the relevant range and possibly a since argument (though maybe since isn’t necessary, since we already know the cached commit is the earliest relevant one?).

Regarding is_boundary, I wasn’t entirely sure what you meant there. Could you clarify?

@Byron — is this roughly what you meant by “incremental” and “cache-friendly without the cache being built in”?

Let me know if that matches your thinking, or if I’m misunderstanding something. If you decide what the best way forward is, I’ll try to implement it ASAP.

@Byron
Copy link
Member

Byron commented Mar 26, 2025

It's great to have you here, @holodorum!

Thanks to your summary I also realise that I might have added salt to the cake by sharing an idea that might be misaligned with what @cruessler was suggesting.
So I will take myself out of there as his opinion is the one that should count here. (Even though I do have a vision of what it could be, I am sure it can be realised in various ways)

@cruessler
Copy link
Contributor

Sorry for the long delay! There’s quite a few things I’m working on simultaneously that it becomes harder to do each of them justice. :-)

My suggestion/idea was related to this comment #1848 (comment) that mentioned creating full blames for certain “checkpoints” and only having to do a blame until it gets to one of the checkpoints.

    a   b   c   d
<---x---x---x---x

In the example above, let’s say there are three checkpoints, a, b and c, with d being the initial commit (the history starting at d and going to the left). For c, the cache would need to store full blames, starting at c and potentially going back to d (but not further since d is the initial commit). For b (and a, respectively), the cache would not need to store the full blame, but only the blame going from b to c at most. Doing a blame for any commit between b and c then becomes the equivalent to git blame c..<commit-in-question> at which point you could start reading the rest of the blame from cache.

For a commit between a and b, you would blame until you reach the cache at b, then you would read from the cache. In order to know whether you would need to continue reading the cache from c or whether a hunk originated somewhere between b and c in which case you were done for that hunk, you could use an additional boolean flag to store that piece of information (this is what I was referencing above when I mentioned is_boundary, this is the piece of information indicated by the ^ in the output of git blame --since="40 weeks ago" Cargo.toml).

What I was proposing was a separate function that would call blame::file(since: <datetime of entries at checkpoint a>) in order to get UnblamedHunks until checkpoint a, then continuing from there by reading from the appropriate caches.

I hope this makes more sense now! (Also, we can hop on a short call if it still doen’t. 😄) And definitely let me know when there are major downsides to this approach or constraints it doesn’t meet.

@holodorum holodorum closed this Apr 13, 2025
@holodorum holodorum deleted the cache-blame-pr branch April 13, 2025 20:48
@holodorum holodorum restored the cache-blame-pr branch April 13, 2025 20:51
@holodorum holodorum reopened this Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants