Skip to content

experimental_index_url fetching of large dep graphs is slow #2014

Closed as not planned
@aignas

Description

@aignas

For context see: https://bazelbuild.slack.com/archives/CA306CEV6/p1719321260320259?thread_ts=1715871259.645269&cid=CA306CEV6

Some of our users have a lot of unique packages (e.g. 700+) and the fetching of PyPI index metadata takes a lot of time. This has to be done at some point though and if users are using MODULE.bazel.lock this is done once when updating the lock file and then users don't have to do it again. The problem comes when users need to update the lock file or the rules_python version is updated - they have to refetch the PyPI index data for all of the 700+ packages, because we are not caching the results of the PyPI index data, because that may change at any time and because we want to call it once per package when evaluating the extension - thus reusing the results across the hubs.

The reuse across hubs vs caching between runs is a trade off we have to make here and I'd be willing to bet that we should instead isolate the calls and do one call to PyPI per package per hub where we use the requirement line and the URL of the index as a cannonical id so that we can store the result of the module_ctx.download in the bazel downloader cache. This means that when users update rules_python or their lock files they do not need to refetch the PyPI index data and it would be much faster. Also, people who are not yet using the lock file but have persistent workers would benefit from this because the extension evaluation would speed up drastically. What is more the rules_python code handling all of this could be simplified.

Alternatives that I thought about and discarded:

  • Make it non-eager - fetch the PyPI data for each package inside whl_library. We still need to get the filenames from somewhere and since it goes to the lock file, it seems that this could only work with lock file formats similar to poetry.lock or pdm.lock, which have the whl filenames in the lock file, but the URL is not specified. Since the URLs are useful in the lock file, it is better to just have them retrieved as part of pip extension evaluation.
  • Have a separate repository rule that fetches the PyPI metadata only and then we pass a label to a file that has the URL instead of the URL itself. This would make the fetching of PyPI metadata lazy, but it suffers from the same problem - we need the filenames of the distributions that we are working with and it makes the code more complex just to avoid using MODULE.bazel.lock.

So the plan of attack:

  • Rework the pip.parse extension to collect all of the requirements from each invocation of pip.parse.
  • Rework the simpleapi_download function to create a cannonical id using the requirement_line (or a list of them) and the URL that we are calling. Add to docs a line stating that users would have to do bazel clean --expunge --async if they want to clear the cache of the PyPI metadata.
  • Ask users to test to see if that makes a difference.

cc: @elviswianda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions