-
-
Notifications
You must be signed in to change notification settings - Fork 577
experimental_index_url fetching of large dep graphs is slow #2014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have a branch and it seems that using the
For 900 package Maybe the only action item from this issue is to update documentation with the recommendations above. |
@aignas I finally migrated to Bazel 7.2 and am giving this a shot. So far, it's so good. I have these two issues/PR that are blocking the setup
|
Closing this issue as docs in general can be always improved and this will be part of #260. |
For context see: https://bazelbuild.slack.com/archives/CA306CEV6/p1719321260320259?thread_ts=1715871259.645269&cid=CA306CEV6
Some of our users have a lot of unique packages (e.g. 700+) and the fetching of PyPI index metadata takes a lot of time. This has to be done at some point though and if users are using
MODULE.bazel.lock
this is done once when updating the lock file and then users don't have to do it again. The problem comes when users need to update the lock file or therules_python
version is updated - they have to refetch the PyPI index data for all of the 700+ packages, because we are not caching the results of the PyPI index data, because that may change at any time and because we want to call it once per package when evaluating the extension - thus reusing the results across the hubs.The reuse across hubs vs caching between runs is a trade off we have to make here and I'd be willing to bet that we should instead isolate the calls and do one call to PyPI per package per hub where we use the requirement line and the URL of the index as a cannonical id so that we can store the result of the
module_ctx.download
in the bazel downloader cache. This means that when users updaterules_python
or their lock files they do not need to refetch the PyPI index data and it would be much faster. Also, people who are not yet using the lock file but have persistent workers would benefit from this because the extension evaluation would speed up drastically. What is more the rules_python code handling all of this could be simplified.Alternatives that I thought about and discarded:
whl_library
. We still need to get thefilenames
from somewhere and since it goes to the lock file, it seems that this could only work with lock file formats similar topoetry.lock
orpdm.lock
, which have the whl filenames in the lock file, but the URL is not specified. Since the URLs are useful in the lock file, it is better to just have them retrieved as part of pip extension evaluation.So the plan of attack:
pip.parse
extension to collect all of the requirements from each invocation ofpip.parse
.simpleapi_download
function to create a cannonical id using therequirement_line
(or a list of them) and the URL that we are calling. Add to docs a line stating that users would have to dobazel clean --expunge --async
if they want to clear the cache of the PyPI metadata.cc: @elviswianda
The text was updated successfully, but these errors were encountered: