-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: remove isdir() check in copy_file(), about 10% of the run time was checking if the file to copy is a directory. #258
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # pyproject.toml
Closes pypa#259 Ref pytest-dev/pytest#12490
6a54b19
to
106561e
Compare
…was checking if the file to copy is a directory.
106561e
to
745640e
Compare
@jaraco I saw you fixed the tests on master. I've rebased and tests are green now for this PR. Can you have another look? There is a diffcov workflow complaining about less test coverage? If i understand correctly, it's saying 2 lines are not covered by the tests, inside bdist_rpm code that's copying the icon of the RPM. I hope that's not a blocker. |
dest = os.path.join(source_dir, os.path.basename(source)) | ||
self.copy_file(source, dest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uneasy with this pattern being now duplicated 5 times across the codebase. I'd rather see something like:
dest = os.path.join(source_dir, os.path.basename(source)) | |
self.copy_file(source, dest) | |
self.copy_file_to_dir(source, source_dir) |
And then do some refactoring to make copy_file_to_dir
perform the dest operation.
"""Copy a file 'src' to 'dst'. | ||
(If the file exists, it will be ruthlessly clobbered.) If 'preserve_mode' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can assume that distutils is the only consumer of this function. Maybe no one else is using it, but we can't simply remove it without risking breaking users. Let's do this instead:
- Rename
copy_file
to_copy_file_direct
, implementing this new, limited interface. - Create a new
copy_file
that implements the backward-compatible behavior (infer dst from dir), raises a DeprecationWarning, and then calls_copy_file_direct
. - Retain the logic of emitting only the dirname if the basenames match (I don't think that change is necessary to get the performance gain).
- Create a new
_copy_file_to_dir
that assumes the dst is a dir and performs the join and basename, calling_copy_file_direct
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 3. the logic is broken as far as I can tell. we can't keep the syscall to isdir()
to determine whether we're dealing with a directory because that's what makes the function slow.
for context, this function is called for thousands of files, it's used by distutils and setuptools/pip when working with packages. (the profiling run was for python setup.py bdist_wheel
)
I think we can't have a replacement function be private with a underscore. It's a public API that is used in other places. it think linters will complain if we import a private _ function in other places (note that setuptools is vendoring this repo and might have more CI rules).
searching on github distutils.file_util AND copy_file AND path:.py
, it's finding 2k files using this function. Going over the results, I see a few are passing a directory. That is problematic. :(
I don't think adding a warning would be helpful, distutils is removed in python 3.12 and already raising warnings/errors. If people ain't fixing errors because the module is removed, they're not going to care about yet another warning. ^^
what I can think of?
- make two functions copy_file_direct() and copy_file_to_dir().
- fix code inside of distutils to use the appropriate function.
- keep copy_file() logic to check whether the argument is a directory + call one of the 2 functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started the refactoring... but unfortunately found out copy_file()
is wrapped inside of distutils.cmd::Command
class and there are a dozen of invocations through self.copy_file
across the codebase.
If we don't want to fix copy_file() but would rather ship a copy_file_direct() + copy_file_to_dir(), then we would have to expose both functions though the class and adjust a dozen invocations. it's not hard to do but that's a fair amount of boilerplate and code to refactor.
I would like to have confirmation we want to do that before engaging into that.
realistically, distutils
is removed from the interpreter as of python 3.12, it's only available inside of setuptools where it's vendored. I initially made the PR on setuptools where it works and pass all tests, it provides a welcome performance improvement to package operations.
Is there really a risk in fixing copy_file() to break other packages? packages that are not already broken because distutils is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyrum's law usually hits hard on distutils/setuptools...
If you want to define 2 other new functions (one for files or one for directories) it should be fine. But my opinion is that we should keep the old behaviour (maybe behind a deprecated warning) for a couple of years before dropping it (if ever dropping would be an option... we have to remember that some packages in the ecosystem, haven't even moved to setuptools yet...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that sounds messy. Let's consider some other options.
A couple of alternative ways we might be able to improve performance without changing the API:
|
I wonder if it makes sense to shortcircuit the condition with a (yet another) new keyword only parameter (e.g. def copy_file( # noqa: C901
src,
dst,
preserve_mode=1,
preserve_times=1,
update=0,
link=None,
verbose=1,
dry_run=0,
+ *,
+ isdir=None,
):
...
- if os.path.isdir(dst):
+ if isdir is True or (isdir is None and os.path.isdir(dst)):
dir = dst
dst = os.path.join(dst, os.path.basename(src))
else:
dir = os.path.dirname(dst) The |
I am trying to think of what to do without getting into horrible hacks 🤔 🤔 🤔 |
That's a good point, although since My instinct - unless there's a clean way to implement this performance improvement, we'll probably want to defer this improvement until distutils is no longer an independent code base (importable as |
( I originally raised the PR to setuptools pypa/setuptools#4406 and they redirected me to this repo. )
Hello,
I was doing a bit of profiling on my CI builds. The step to build package (python setup.py bdist_wheel) is taking more time than I would like.
Going in with a commercial profiler, I am finding that nearly 10% of the time is taken in isdir() calls, mostly in the copy_file() function.
Which is quite surprising IMO because the copy file function should not be given directories anyway :D
Can I offer to remove the isdir() calls to get a speedup?
I've found 2 calls that were passing directories instead of a filepath and corrected them.
By the way, the whole _copy_file_contents() could be replaced by shutil.copyfile() for another little boost. The code goes back to year 2000 or before, which predates optimized shutil functions.
Screenshot of the profiler run. The whole run is about 20+ seconds.
All tests are passing in the setuptools repos, including all integration tests.
They don't seem to be able to run in this repo, failing in some macos imports
Regards.