Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't detect copies/renames in restore-mtime #72

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benjaminjkraft
Copy link

The big benefit of this is in a blobless clone, where it lets you avoid downloading all the blobs (which you shouldn't need just to know whether a file changed). But it's probably a perf win always: it makes git log do a bit less work. (And it should always be at least as correct: for the purposes of mtime restoration, a move/copy should count as a modification -- your build tool should assume the file must be rebuilt in its new location.)

I tested this by:

  • make a clone of a large repo (git clone --filter=blob:none ...)
  • run git restore-mtime in the repo; after this change it takes ~10s

Before this change, in a blobless repo, it would loop for a very long time with periodic log lines indicating it was fetching more blobs:

remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (1/1), done.
Receiving objects: 100% (2/2), 6.75 KiB | 2.25 MiB/s, done.
remote: Total 2 (delta 0), reused 0 (delta 0), pack-reused 1

The big benefit of this is in a [blobless clone], where it lets you
avoid downloading all the blobs (which you shouldn't need just to know
_whether_ a file changed). But it's probably a perf win always: it makes
`git log` do a bit less work. (And it should always be at least as
correct: for the purposes of mtime restoration, a move/copy should count
as a modification -- your build tool should assume the file must be
rebuilt in its new location.)

[blobless clone]: https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/

I tested this by:
- make a clone of a large repo (`git clone --filter=blob:none ...`)
- run `git restore-mtime` in the repo; after this change it takes ~10s

Before this change, in a blobless repo, it would loop for a very long
time with periodic log lines indicating it was fetching more blobs:
```
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (1/1), done.
Receiving objects: 100% (2/2), 6.75 KiB | 2.25 MiB/s, done.
remote: Total 2 (delta 0), reused 0 (delta 0), pack-reused 1
```
@MestreLion
Copy link
Owner

... And it should always be at least as correct: for the purposes of mtime restoration, a move/copy should count as a modification -- your build tool should assume the file must be rebuilt in its new location.

That's a very interesting rationale, one which I've never considered. Actually, I was planning a major change for the next version in the opposite direction: to make git-restore-mtime more accurate regarding file renames, moves, and directory mtime. There are several open issues and PRs regarding this, including #47 and #61 , and a re-thinking of the "update" selection algorithm is in my TO-DO list for years. I was planning to tackle them all when I had the time and then do a new release, but unfortunately that didn't happen yet.

So I'm quite on the fence now... should git-restore-mtime remain "naive" about renames and moves for the sake of build tools? Should it be "smarter" and try to re-create the "proper" time stamps given the actions performed on a file's lifetime for the sake of tarballs/archiving? Maybe use the new algorithm only when the planned --accurate mode is selected, and leave current behavior as-is? (but fixing the empty/grandparent dirs bug)?

Does any interested party have an opinion on this? How should git-restore-mtime behave on renames, moves, and directories?

@MestreLion MestreLion added enhancement RFC - Request for Comments Need opinions from user base! labels Apr 23, 2024
@Repiteo
Copy link
Contributor

Repiteo commented May 1, 2024

I don't see why making git-restore-mtime more accurate would be mutually exclusive with this proposal. Ben makes a great point: a moved/copied file is inherently modified, so there's no real need to track their copies/originals. I wouldn't call this less accurate at all; if anything, it's more precise in an mtime context. If there were to be an option to track the copies/originals, it should be under something like --preserve-lifetime.

@benjaminjkraft
Copy link
Author

FWIW I agree it's less accurate in the sense that it's inconsistent with the behavior of the mtime field of a file on a file system. My guess is in practice most callers either don't care about the precise behavior as long as it's deterministic (reproducible tarballs, etc.) or prefer the mtime to update (build caches, etc.). But it is a divergence from the "standard" [1].

With that said, I do think the behavior differences are likely very small for most repos; my experience is moves/copies with no changes are a very small fraction of file-changes although I'm sure it depends on the language in question. So the perf win may be a better default even aside from callers who prefer one way or the other. (Although it's possible setting the rename modification threshold to zero suffices for the perf win with the same behavior? I haven't looked into whether that allows git to just compare blob shas without looking at the blob itself.)

[1] As far as I can see POSIX doesn't say this explicitly, but it seems to be a de facto standard.

@MestreLion
Copy link
Owner

a moved/copied file is inherently modified, so there's no real need to track their copies/originals. [...] If there were to be an option [...], it should be under something like --preserve-lifetime.

This --preserve-lifetime is exactly what I mean by an --accurate mode. An algorithm that tries its best to re-create a file's lifetime of actions, including "pure renames" and "pure moves", and the effect on its timestamp and the timestamp of affected directories, to obtain the timestamp such file would have on the filesystem. In that sense, a purely renamed or moved file does not have its mtime changed. It does however change the source and destination directory mtime.

The great point Ben made is that perhaps this "real" timestamp, while more "correct", might be not as useful for the intended use and audience of git-restore-mtime.

@alerque
Copy link
Contributor

alerque commented May 28, 2024

This seems to be a very close parallel to the issue I just noted in #77. In the case of merges there may not be a rename or copy. There may or may not also be local changes to the file (e.g. merge conflict resolution) in the merge commit, but it's getting entirely passed over. The result is a less accurate time stamp: it is not the timestamp of the last point the file is known to have changed in history.

Whether there is a use case for the other dates I don't know because my use case is decidedly weighted towards coping with build systems. For anybody else with a need for dates targeted at builds, my competitor tool (that I wrote before I heard about this project) git-warp-time was setup with build scenarios in mind and takes both merges and copy/rename operations into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement RFC - Request for Comments Need opinions from user base!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants