experimental_index_url fetching of large dep graphs is slow

For context see: https://bazelbuild.slack.com/archives/CA306CEV6/p1719321260320259?thread_ts=1715871259.645269&cid=CA306CEV6

Some of our users have a lot of unique packages (e.g. 700+) and the fetching of PyPI index metadata takes a lot of time. This has to be done at some point though and if users are using `MODULE.bazel.lock` this is done once when updating the lock file and then users don't have to do it again. The problem comes when users need to update the lock file or the `rules_python` version is updated - they have to refetch the PyPI index data for all of the 700+ packages, because we are not caching the results of the PyPI index data, because that may change at any time and because we want to call it once per package when evaluating the extension - thus reusing the results across the hubs.

The reuse across hubs vs caching between runs is a trade off we have to make here and I'd be willing to bet that we should instead isolate the calls and do one call to PyPI per package per hub where we use the requirement line and the URL of the index as a cannonical id so that we can store the result of the `module_ctx.download` in the bazel downloader cache. This means that when users update `rules_python` or their lock files they do not need to refetch the PyPI index data and it would be much faster. Also, people who are not yet using the lock file but have persistent workers would benefit from this because the extension evaluation would speed up drastically. What is more the rules_python code handling all of this could be simplified.

Alternatives that I thought about and discarded:
* Make it non-eager - fetch the PyPI data for each package inside `whl_library`. We still need to get the `filenames` from somewhere and since it goes to the lock file, it seems that this could only work with lock file formats similar to `poetry.lock` or `pdm.lock`, which have the whl filenames in the lock file, but the URL is not specified. Since the URLs are useful in the lock file, it is better to just have them retrieved as part of pip extension evaluation.
* Have a separate repository rule that fetches the PyPI metadata only and then we pass a label to a file that has the URL instead of the URL itself. This would make the fetching of PyPI metadata lazy, but it suffers from the same problem - we need the filenames of the distributions that we are working with and it makes the code more complex just to avoid using MODULE.bazel.lock.

So the plan of attack:
- Rework the `pip.parse` extension to collect all of the requirements from each invocation of `pip.parse`.
- Rework the `simpleapi_download` function to create a cannonical id using the `requirement_line` (or a list of them) and the URL that we are calling. Add to docs a line stating that users would have to do `bazel clean --expunge --async` if they want to clear the cache of the PyPI metadata.
- Ask users to test to see if that makes a difference.

cc: @elviswianda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

experimental_index_url fetching of large dep graphs is slow #2014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

experimental_index_url fetching of large dep graphs is slow #2014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions