Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow populating binary's venv site-packages with symlinks #2617

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rickeylev
Copy link
Collaborator

This implements functionality to allow libraries to populate the site-packages directory
of downstream binaries. The basic implementation is:

  • Libraries provide tuples of (runfile path, site packages path) in the
    PyInfo.site_packages_symlinks field.
  • Binaries create symlinks (using declare_symlink) in their site-packages directory
    pointing to the runfiles paths libraries provide.

The design was chosen because of the following properties:

  • The site-packages directory is relocatable
  • Populating site packages is cheap ( O(number 3p dependencies) )
  • Dependencies are only created once in the runfiles, no matter
    how many how many binaries there that use them. This minimizes disk usage,
    file counts, inodes, etc.

The site_packages_symlinks field is a depset with topological ordering. Using topological
ordering allows dependencies closer to the binary to have precedence, which gives some
basic control over what entries are used.

Additionally, the runfiles path to link to can be None/empty, in which case, the
directory in site-packages won't be created. This allows binaries to prevent creation
of directories that might e.g. conflict.

This functionality is exposed using the new py_library.site_packages_root attribute.
When set, it acts as a path prefix for the files in srcs to identify what files/directories
should be put into site_packages_symlinks. This attribute does basic detection of
implicit namespace directories, which allows multiple distributions to "install" into
the the same site-packages directory.

Though this functionality is primarily useful for dependencies from pypi (e.g. via
pip.parse), it is not yet activated for those targets, for two main reasons:

  1. The wheel extraction code creates pkgutil-style __init__.py shims during the repo-phase.
    The build phase can't distinguish these artifical rules_python generated shims from
    actual __init__.py files, which breaks the implicit namespace detection logic.
  2. A flag guard is needed before changing the behavior. Even though how 3p libraries are
    added to sys.path is an implementation detail, the behavior has been there for many
    years, so an escape hatch should be added.

Work towards #2156

@rickeylev rickeylev requested a review from aignas as a code owner February 17, 2025 05:43
Copy link
Collaborator

@aignas aignas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few thoughts:
I would love the init files to be created at build time where we could avoid generating them if we are using venv stuff.

I want to get the pip whl extracting as close to http archive as possible so that we can drop the python interpreter requirement.


Off topic but the above is related to what I want to do:

  • implement env marker eval in starlark. A PR will come in the next 2 weeks.
  • target for each provided extra in the hub and spoke repo.
  • proper full python version support for metadata eval.
  • separate the repo rules that just extract wheels and the ones that build/extract with pip.

One of the things that I needed/thought that could be a problem was the init file generation that we currently have, so I would love to see that go in one or another form from the repo context.

:::{attention}
Setting both this and the {attr}`site_packages_root` attribute may result in
undefined behavior. Both will result in the code being importable, but from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error out here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Yeah. I can't imagine why one would want it on sys.path twice.

Even in the case where there are conflicts creating the symlinks, the later entry on sys.path from the imports attr wouldn't help -- sys.path is coming first. I suppose it would help for namespace packages but, if they're namespace packages, then the files probably wouldn't be conflicting?

Yeah, I'm having a hard time seeing the value in both being set. I'll change it to fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator Author

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

A design goal I wasn't able to quite accomplish was not requiring raw relative symlinks. I really wish something like (str sp_dir, depset[File]) could be passed up, and then we call something like ctx.actions.symlink_relative_tree(relative_to=sp_dir, files). The main point being, this is a more accurate file-by-file "install" of a library into site-packages that better matches how a regular py install works (where distributions can somewhat arbitrarily overlap into a directory).

At the least, given (sp_dir, files), we'd be able to more cheaply resolve conflicts instead of failing (we can flatten the depsets of just conflicts, instead of the entire runfiles).

The disadvantage is it requires N operations to materialize the files instead of 1. So, still not quite perfect.

I guess what it needs is something like (sp_dir, runfiles_sp_dir, files). And it'd create a raw symlink when possible, then fall back to flattening if there are conflicts/overlaps. This feels complicated/over-engineered, though. So feels like I'm missing something.

:::{attention}
Setting both this and the {attr}`site_packages_root` attribute may result in
undefined behavior. Both will result in the code being importable, but from
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Yeah. I can't imagine why one would want it on sys.path twice.

Even in the case where there are conflicts creating the symlinks, the later entry on sys.path from the imports attr wouldn't help -- sys.path is coming first. I suppose it would help for namespace packages but, if they're namespace packages, then the files probably wouldn't be conflicting?

Yeah, I'm having a hard time seeing the value in both being set. I'll change it to fail.

* fail if both imports and site_packages_root set
* doc the topological ordering more
* skip, not fail, for child-path symlinks
* Also treat .pyi files as namespace anchors
Copy link
Collaborator Author

@rickeylev rickeylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

I also update the logic for conflicting sub-paths of a symlink: I changed it to ignore instead of fail. To match how exact-matches are handled, the first symlink has precedence, and any sub-links are skipped.

:::{attention}
Setting both this and the {attr}`site_packages_root` attribute may result in
undefined behavior. Both will result in the code being importable, but from
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aignas
Copy link
Collaborator

aignas commented Feb 24, 2025

I was thinking about your above statement for some time:

I guess what it needs is something like (sp_dir, runfiles_sp_dir, files). And it'd create a raw symlink when possible, then fall back to flattening if there are conflicts/overlaps. This feels complicated/over-engineered, though. So feels like I'm missing something.

I think that in the long term having a bulletproof solution would pay off if the modelling is right. Imagine one wanted at some point to create a venv to run their py_binary in a managed cloud providers python runtime and wanted to have a single venv with the userspace code also included. They may have one py_library per python file in their dependency tree and then the model of symlinking the dirs only instead of falling back to merging individual files may completely not work here.

I understand that the current solution may cover 95% if not more of the usecases, but the remaining 5% might be annoying to fix later.

Could you outline a little bit more of your thinking about which part feels too complicated/overengineered?

@rickeylev
Copy link
Collaborator Author

I understand that the current solution may cover 95% if not more of the usecases, but the remaining 5% might be annoying to fix later.

Yeah, I agree. It'd be nice if we could reduce "make things show up in the venv" to e.g. a single provider field.

what's complicated/over-engineered about (sp_dir, runfiles_path, files)

I think the part that feels complicated is py_library has to pass extra info up just in case, and then py_binary has to flatten, merge and handle duplicates, then declare_{file,directory,symlink}, then symlink. That code isn't particularly complicated, though. I am just generally annoyed there isn't some ctx.symlink_relative_tree() type of API.

Another thing on my mind was the pkgutil-style namespace packages stuff. We can't look inside those files at analysis time, so they'd have to be specially called out somewhere, and then stored somewhere in the provider (another provider field, or another entry in the tuple?). When merging, they'd require special handling (I think?).

if deps are one file per py_libraries, then symlinking dirs won't work

How so? What file system structure and targets did you have in mind? Below is the structure I'm thinking of, which should work OK:

# //src/BUILD
py_library(name="foo",
py_library(name="foo",
  srcs=["site-packages/mylib/foo.py, __ini
  srcs=["site-packages/mylib/foo.py, __init__.py"],
  site_packages_root="site-packages/")
py_library(name="bar",
  srcs=["site-packages/mylib/bar.py, __init__.py"],
  site_packages_root="site-packages/)

# Resulting runfiles:
bin.runfiles/
  _main/src/site-packages/mylib
    foo.py, bar.py
  _main/bin.venv/site-packages/
    mylib -> .../_main/src/site-packages/mylib

This case still works because the files are part of the same package, i.e they'd have the same site_packages_root value (and thus (sp_dir, runfiles_path) tuple).

In order for it to not work, then it has to have the same structure as if two pypi distributions install into the same directory (different (sp_dir, rf path) tuples).

This does point out a new case (is this the case you had in mind?): a monorepo with many targets that are all part of the same python package. e.g. //src/foo/bar:baz doesn't have a way to say "I should be importable as foo.bar.baz because site_packages_root is bazel package relative, not repo-relative. The PyInfo.site_packages_symlinks info can represent that (foo, _main/src/foo), though.

@aignas
Copy link
Collaborator

aignas commented Feb 24, 2025

Ah thanks for the explanation, having a comment written somewhere outlining this example would be good.

This does point out a new case (is this the case you had in mind?): a monorepo with many targets that are all part of the same python package. e.g. //src/foo/bar:baz doesn't have a way to say "I should be importable as foo.bar.baz because site_packages_root is bazel package relative, not repo-relative. The PyInfo.site_packages_symlinks info can represent that (foo, _main/src/foo), though.

I am not sure I fully understand that. Let me think about it a little bit more. Ah now I see. Yeah, the site_packages_symlinks is somewhat non-orthogonal to imports field, but I think that may be fine, up to a point. I think this brings an interesting question:

Should we allow multiple site_packages_symlinks values for a dep tree for a given workspace?

I would say that no - your venv should be always the same/unambiguous no matter how big the monorepo is. This way we can do optimizations and reason easier about how different workspaces are laid into the venv.

That said you may have a layout of your monorepo as:

/src/a
/src/b
/standalone_tools/foo
/standalone_tools/bar
/standalone_tools/bar/baz

and you may want the venv site-packages layout to be

.../site-packages/a
.../site-packages/b
.../site-packages/foo
.../site-packages/bar
.../site-packages/bar/baz

But I am definitely against the users being able to specify a site-packages layout

.../site-packages/a
.../site-packages/b
.../site-packages/foo
.../site-packages/bar
.../site-packages/bar/baz
.../site-packages/baz

I am not sure why I am against it though. It feels like there is a footgun somewhere here.

@@ -140,6 +142,31 @@ A depset of import path strings to be added to the `PYTHONPATH` of executable
Python targets. These are accumulated from the transitive `deps`.
The order of the depset is not guaranteed and may be changed in the future. It
is recommended to use `default` order (the default).
""",
"site_packages_symlinks": """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a note that this is still unstable and may be removed/changed in between minor version releases of rules_python.

@@ -55,6 +58,47 @@ LIBRARY_ATTRS = union_attrs(
create_srcs_version_attr(values = SRCS_VERSION_ALL_VALUES),
create_srcs_attr(mandatory = False),
{
"site_packages_root": attr.string(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to my comment above.

@rickeylev
Copy link
Collaborator Author

I might have come up with a variation of the "all code lives in main repo but causes conflicting symlinks" situation: if a target generates a parallel directory structure. The case I have in mind are protos, in particular, when their virtual imports feature is used. I think you can end up with a runfiles looking like this?

$runfilesRoot/
  _main/foo/
    foolib.py
    _virtual_imports/
      foo/protos/
        myproto_pb2.py

And the desired end-result is e.g. import foo; import foo.protos.myproto_pb2.

@aignas
Copy link
Collaborator

aignas commented Feb 26, 2025

A few extra thoughts:

  • How does this interact with the .pyc compilation feature we have?
  • I somehow think that just doing the file symlinking always instead of directory simlinking would allow us construct the venv incrementally? Could we fail building of a py_library if we detect duplicate symlinks?

I probably need a better mental image of how this incremental building transfers to the bazel sandbox layouts... It would be great if each py_library could build a portion of site-packages, however, it seems that it is not possible.

The runfiles tree is built only in py_binary and py_library, but I don't know how to do that. It seems that bazelbuild/bazel#23728 is needed for our perfect solution here.

Considering your statement before:

I think the part that feels complicated is py_library has to pass extra info up just in case, and then py_binary has to flatten, merge and handle duplicates, then declare_{file,directory,symlink}, then symlink. That code isn't particularly complicated, though. I am just generally annoyed there isn't some ctx.symlink_relative_tree() type of API.

I think in some sense the flatten, merge and handling of duplicates needs to happen at each node in the graph. Maybe then it could work?

Another complication could be how package overrides work. i.e. if I have pydantic v1 somewhere in the tree and then the terminal py_binary adds pydantic v2 there are few types of behaviour you may want:

  • Fail, because that means that one has colliding deps.
  • Override (replace pydantic) and continue.
  • What do we do about other common files?

However, that goes very interestingly alongside the namespace packages

  • Merge, because the .dist-info folders that the py_library values provide are different.
  • Fail if we have the same .dist-info folders provided by two packages (similar to above)

The virtual packages for py_proto_library that you brought up sounds like imports should be used for the py_venv_root flag and when we are in the symlink venv mode, there should be 1 or 0 imports values allowed so that we have a deterministic way where to put the symlinks.

@rickeylev
Copy link
Collaborator Author

getting the modelling right

Hm. So stepping back, what (sp_dir=x/y, runfiles_path=..., files=F) is trying to say is something like: I have files to contribute to the package x.y, make them available for import.

So the naive impl is simple: flatten everything, and declare_file()/symlink() it into the binary's venv directory. Simple, but expensive.

it'd be great if each py_library could build a portion of site-packages

Yeah, that would be perfect, but the main problem is a py_library doesn't know where site-packages is (that's decided by the final consumer).

The closet I can think of to that is to delegate file creation back to the library via a callback. e.g.

# File: py_binary.bzl
site_packages_dir = "_{name}.venv/site-packages"
# map of path -> file
site_packages_map = {}
for entry in [dep[PyInfo].site_packages:
  entry.populate_site_packages(ctx, site_packages_dir, site_packages_map, entry)

# File: py_library.bzl
def populate_site_packages(ctx, sp_dir, sp_map, entry):
  for file in entry.files.to_list():
    relpath = file.removeprefix(entry.runfiles_site_packages_prefix)
    if relpath in sp_map: continue
    sp_file = ctx.declare_file(sp_dir/relpath)
    ctx.symlink(sp_file, file)

return PyInfo(
  site_packages=struct(populate_site_packages=populate_site_packages, ...)
)
  • The above basically equates to flattening the runfiles
  • Its O(total number of files) instead of O(number of dirs).
  • I think this makes optimizing (symlink to a directory) and conflict handling (at the binary level) problematic (maybe mutually exclusive)? Once declare_XXX is called, the output has to be generated, or the file not included.

Yeah...not sure I like that in comparison. It seems better to put more logic in the caller (py_binary).

Another idea: an aspect? a py_binary-level aspect visits targets, collects...something...and then...py_binary ends up having to do the same conflict/merge logic.

It really seems like falling back to flatten and merge is the best option. I just don't see a way to nicely avoid it.

What about pyc files?

They should work...assuming the directory is symlinked. This does point out a complication of merging conflicts: its not just the py files we need to merge; we need to merge all the files that would have been accessible if the directory was symlinked.

Another complication could be how package overrides work. i.e. if I have pydantic v1 somewhere in the tree and then the terminal py_binary adds pydantic v2 there are few types of behaviour you may want

This is part of the reason I opted for having the topological ordering -- you can put the version you want earlier in the dependency ordering.

Good point about how this might affect conflict merging, though. It sounds like we need to add e.g. "version" to the (sp_dir, rf_path, files) tuple? And then when merging, the first version wins, and files from other versions are ignored. Or change "version" to "dist-info" -- the case that came to mind is how e.g. pytorch has multiple distributions at the same version, but for different gpus.

imports should be used for the py_venv_root flag

Yeah, I'm starting to think the same. Having a global flag to switch between "add a bunch of things to sys.path" and "add a bunch of things to a generated venv" sounds like a nicer way to switch than "add an attribute to targets".

merge at every node of the graph

Hm. Maybe. Having py_library merge doesn't help much (eventually py_binary has N py_libraries to merge). but this could help with the combination of using attr.imports and avoding having to flatten the full runfiles.

Basically: lets say we use attr.imports to determine the prefix for the site-packages relative paths. The catch is a library may not set that itself, but rely on another target setting it. This means, in py_binary, we have to flatten runfiles, then build up the venv paths based on runfiles and imports. I think we can avoid this if py_library does this instead? And then py_binary only has to fully flatten if there are conflicts (which should occur less with libraries doing this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants