Description
For context see: https://bazelbuild.slack.com/archives/CA306CEV6/p1719321260320259?thread_ts=1715871259.645269&cid=CA306CEV6
Some of our users have a lot of unique packages (e.g. 700+) and the fetching of PyPI index metadata takes a lot of time. This has to be done at some point though and if users are using MODULE.bazel.lock
this is done once when updating the lock file and then users don't have to do it again. The problem comes when users need to update the lock file or the rules_python
version is updated - they have to refetch the PyPI index data for all of the 700+ packages, because we are not caching the results of the PyPI index data, because that may change at any time and because we want to call it once per package when evaluating the extension - thus reusing the results across the hubs.
The reuse across hubs vs caching between runs is a trade off we have to make here and I'd be willing to bet that we should instead isolate the calls and do one call to PyPI per package per hub where we use the requirement line and the URL of the index as a cannonical id so that we can store the result of the module_ctx.download
in the bazel downloader cache. This means that when users update rules_python
or their lock files they do not need to refetch the PyPI index data and it would be much faster. Also, people who are not yet using the lock file but have persistent workers would benefit from this because the extension evaluation would speed up drastically. What is more the rules_python code handling all of this could be simplified.
Alternatives that I thought about and discarded:
- Make it non-eager - fetch the PyPI data for each package inside
whl_library
. We still need to get thefilenames
from somewhere and since it goes to the lock file, it seems that this could only work with lock file formats similar topoetry.lock
orpdm.lock
, which have the whl filenames in the lock file, but the URL is not specified. Since the URLs are useful in the lock file, it is better to just have them retrieved as part of pip extension evaluation. - Have a separate repository rule that fetches the PyPI metadata only and then we pass a label to a file that has the URL instead of the URL itself. This would make the fetching of PyPI metadata lazy, but it suffers from the same problem - we need the filenames of the distributions that we are working with and it makes the code more complex just to avoid using MODULE.bazel.lock.
So the plan of attack:
- Rework the
pip.parse
extension to collect all of the requirements from each invocation ofpip.parse
. - Rework the
simpleapi_download
function to create a cannonical id using therequirement_line
(or a list of them) and the URL that we are calling. Add to docs a line stating that users would have to dobazel clean --expunge --async
if they want to clear the cache of the PyPI metadata. - Ask users to test to see if that makes a difference.
cc: @elviswianda