-
-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow populating binary's venv site-packages with symlinks #2617
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few thoughts:
I would love the init files to be created at build time where we could avoid generating them if we are using venv stuff.
I want to get the pip whl extracting as close to http archive as possible so that we can drop the python interpreter requirement.
Off topic but the above is related to what I want to do:
- implement env marker eval in starlark. A PR will come in the next 2 weeks.
- target for each provided extra in the hub and spoke repo.
- proper full python version support for metadata eval.
- separate the repo rules that just extract wheels and the ones that build/extract with pip.
One of the things that I needed/thought that could be a problem was the init file generation that we currently have, so I would love to see that go in one or another form from the repo context.
python/private/attributes.bzl
Outdated
:::{attention} | ||
Setting both this and the {attr}`site_packages_root` attribute may result in | ||
undefined behavior. Both will result in the code being importable, but from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error out here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Yeah. I can't imagine why one would want it on sys.path twice.
Even in the case where there are conflicts creating the symlinks, the later entry on sys.path from the imports attr wouldn't help -- sys.path is coming first. I suppose it would help for namespace packages but, if they're namespace packages, then the files probably wouldn't be conflicting?
Yeah, I'm having a hard time seeing the value in both being set. I'll change it to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
A design goal I wasn't able to quite accomplish was not requiring raw relative symlinks. I really wish something like (str sp_dir, depset[File])
could be passed up, and then we call something like ctx.actions.symlink_relative_tree(relative_to=sp_dir, files)
. The main point being, this is a more accurate file-by-file "install" of a library into site-packages that better matches how a regular py install works (where distributions can somewhat arbitrarily overlap into a directory).
At the least, given (sp_dir, files)
, we'd be able to more cheaply resolve conflicts instead of failing (we can flatten the depsets of just conflicts, instead of the entire runfiles).
The disadvantage is it requires N operations to materialize the files instead of 1. So, still not quite perfect.
I guess what it needs is something like (sp_dir, runfiles_sp_dir, files)
. And it'd create a raw symlink when possible, then fall back to flattening if there are conflicts/overlaps. This feels complicated/over-engineered, though. So feels like I'm missing something.
python/private/attributes.bzl
Outdated
:::{attention} | ||
Setting both this and the {attr}`site_packages_root` attribute may result in | ||
undefined behavior. Both will result in the code being importable, but from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Yeah. I can't imagine why one would want it on sys.path twice.
Even in the case where there are conflicts creating the symlinks, the later entry on sys.path from the imports attr wouldn't help -- sys.path is coming first. I suppose it would help for namespace packages but, if they're namespace packages, then the files probably wouldn't be conflicting?
Yeah, I'm having a hard time seeing the value in both being set. I'll change it to fail.
* fail if both imports and site_packages_root set * doc the topological ordering more * skip, not fail, for child-path symlinks * Also treat .pyi files as namespace anchors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
I also update the logic for conflicting sub-paths of a symlink: I changed it to ignore instead of fail. To match how exact-matches are handled, the first symlink has precedence, and any sub-links are skipped.
python/private/attributes.bzl
Outdated
:::{attention} | ||
Setting both this and the {attr}`site_packages_root` attribute may result in | ||
undefined behavior. Both will result in the code being importable, but from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I was thinking about your above statement for some time:
I think that in the long term having a bulletproof solution would pay off if the modelling is right. Imagine one wanted at some point to create a venv to run their py_binary in a managed cloud providers python runtime and wanted to have a single venv with the userspace code also included. They may have one py_library per python file in their dependency tree and then the model of symlinking the dirs only instead of falling back to merging individual files may completely not work here. I understand that the current solution may cover 95% if not more of the usecases, but the remaining 5% might be annoying to fix later. Could you outline a little bit more of your thinking about which part feels too complicated/overengineered? |
Yeah, I agree. It'd be nice if we could reduce "make things show up in the venv" to e.g. a single provider field.
I think the part that feels complicated is py_library has to pass extra info up just in case, and then py_binary has to flatten, merge and handle duplicates, then Another thing on my mind was the pkgutil-style namespace packages stuff. We can't look inside those files at analysis time, so they'd have to be specially called out somewhere, and then stored somewhere in the provider (another provider field, or another entry in the tuple?). When merging, they'd require special handling (I think?).
How so? What file system structure and targets did you have in mind? Below is the structure I'm thinking of, which should work OK:
This case still works because the files are part of the same package, i.e they'd have the same site_packages_root value (and thus In order for it to not work, then it has to have the same structure as if two pypi distributions install into the same directory (different (sp_dir, rf path) tuples). This does point out a new case (is this the case you had in mind?): a monorepo with many targets that are all part of the same python package. e.g. |
Ah thanks for the explanation, having a comment written somewhere outlining this example would be good.
I am not sure I fully understand that. Let me think about it a little bit more. Ah now I see. Yeah, the
I would say that no - your That said you may have a layout of your monorepo as:
and you may want the venv
But I am definitely against the users being able to specify a
I am not sure why I am against it though. It feels like there is a footgun somewhere here. |
@@ -140,6 +142,31 @@ A depset of import path strings to be added to the `PYTHONPATH` of executable | |||
Python targets. These are accumulated from the transitive `deps`. | |||
The order of the depset is not guaranteed and may be changed in the future. It | |||
is recommended to use `default` order (the default). | |||
""", | |||
"site_packages_symlinks": """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a note that this is still unstable and may be removed/changed in between minor version releases of rules_python
.
python/private/py_library.bzl
Outdated
@@ -55,6 +58,47 @@ LIBRARY_ATTRS = union_attrs( | |||
create_srcs_version_attr(values = SRCS_VERSION_ALL_VALUES), | |||
create_srcs_attr(mandatory = False), | |||
{ | |||
"site_packages_root": attr.string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly to my comment above.
I might have come up with a variation of the "all code lives in main repo but causes conflicting symlinks" situation: if a target generates a parallel directory structure. The case I have in mind are protos, in particular, when their virtual imports feature is used. I think you can end up with a runfiles looking like this?
And the desired end-result is e.g. |
A few extra thoughts:
I probably need a better mental image of how this incremental building transfers to the bazel sandbox layouts... It would be great if each The Considering your statement before:
I think in some sense the flatten, merge and handling of duplicates needs to happen at each node in the graph. Maybe then it could work? Another complication could be how package overrides work. i.e. if I have
However, that goes very interestingly alongside the namespace packages
The virtual packages for |
Hm. So stepping back, what So the naive impl is simple: flatten everything, and declare_file()/symlink() it into the binary's venv directory. Simple, but expensive.
Yeah, that would be perfect, but the main problem is a py_library doesn't know where site-packages is (that's decided by the final consumer). The closet I can think of to that is to delegate file creation back to the library via a callback. e.g.
Yeah...not sure I like that in comparison. It seems better to put more logic in the caller (py_binary). Another idea: an aspect? a py_binary-level aspect visits targets, collects...something...and then...py_binary ends up having to do the same conflict/merge logic. It really seems like falling back to flatten and merge is the best option. I just don't see a way to nicely avoid it.
They should work...assuming the directory is symlinked. This does point out a complication of merging conflicts: its not just the py files we need to merge; we need to merge all the files that would have been accessible if the directory was symlinked.
This is part of the reason I opted for having the topological ordering -- you can put the version you want earlier in the dependency ordering. Good point about how this might affect conflict merging, though. It sounds like we need to add e.g. "version" to the (sp_dir, rf_path, files) tuple? And then when merging, the first version wins, and files from other versions are ignored. Or change "version" to "dist-info" -- the case that came to mind is how e.g. pytorch has multiple distributions at the same version, but for different gpus.
Yeah, I'm starting to think the same. Having a global flag to switch between "add a bunch of things to sys.path" and "add a bunch of things to a generated venv" sounds like a nicer way to switch than "add an attribute to targets".
Hm. Maybe. Having py_library merge doesn't help much (eventually py_binary has N py_libraries to merge). but this could help with the combination of using Basically: lets say we use |
…o venv.site.packages
…into venv.site.packages
This implements functionality to allow libraries to populate the site-packages directory
of downstream binaries. The basic implementation is:
(runfile path, site packages path)
in thePyInfo.site_packages_symlinks
field.pointing to the runfiles paths libraries provide.
The design was chosen because of the following properties:
O(number 3p dependencies)
)how many how many binaries there that use them. This minimizes disk usage,
file counts, inodes, etc.
The
site_packages_symlinks
field is a depset with topological ordering. Using topologicalordering allows dependencies closer to the binary to have precedence, which gives some
basic control over what entries are used.
Additionally, the runfiles path to link to can be None/empty, in which case, the
directory in site-packages won't be created. This allows binaries to prevent creation
of directories that might e.g. conflict.
This functionality is exposed using the new
py_library.site_packages_root
attribute.When set, it acts as a path prefix for the files in
srcs
to identify what files/directoriesshould be put into
site_packages_symlinks
. This attribute does basic detection ofimplicit namespace directories, which allows multiple distributions to "install" into
the the same site-packages directory.
Though this functionality is primarily useful for dependencies from pypi (e.g. via
pip.parse), it is not yet activated for those targets, for two main reasons:
__init__.py
shims during the repo-phase.The build phase can't distinguish these artifical rules_python generated shims from
actual
__init__.py
files, which breaks the implicit namespace detection logic.added to sys.path is an implementation detail, the behavior has been there for many
years, so an escape hatch should be added.
Work towards #2156