Skip to content

Conversation

@ksobrenat32
Copy link
Contributor

Refactor how Python package dependencies are staged and cached in the Apache Beam portability runner. The main change is to track and use the exact set of downloaded package files, rather than globbing all files in the cache directory. Also updated the tests accordingly.

… on the required files

This was achieved by saving a list of dependencies and downloading only those files
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ksobrenat32, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the Python package dependency staging mechanism within the Apache Beam portability runner. The core change involves transitioning from a generic glob-based discovery of cached packages to an explicit tracking of precisely downloaded package files. This enhancement aims to improve the accuracy and reliability of dependency management by ensuring that only the intended and verified packages are staged for execution, thereby preventing potential issues arising from ambiguous file selection.

Highlights

  • Explicit Package Tracking: The system now explicitly tracks and utilizes the exact file paths of downloaded Python packages for staging, moving away from a broad glob-based approach that could inadvertently include unintended files from the cache directory.
  • populate_requirements_cache Return Value: The populate_requirements_cache function in stager.py has been updated to return a list of strings, representing the absolute paths of the successfully downloaded packages, which are then used for staging.
  • Test Updates: Corresponding test cases in stager_test.py have been modified to reflect the new return type and behavior of the populate_requirements_cache function, ensuring accurate testing of the updated staging logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@ksobrenat32
Copy link
Contributor Author

assign set of reviewers

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @liferoad for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

processes.check_output(cmd_args, stderr=processes.STDOUT)
output = processes.check_output(cmd_args, stderr=subprocess.STDOUT)
downloaded_packages = []
for line in output.decode('utf-8').split('\n'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to packages that were previously in the requirements cache, hence not downloaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packages already in the requirements cache will not appear in the output, this is because pip will not try to download them again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means they won't be staged, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, just the packages needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is, the packages already in the local cache dir, but also still required in requirements.txt won't get staged, and this is not as intended. Does this also match with your understanding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial understanding, as reflected in the current implementation, was that we only needed to stage newly downloaded or updated packages. I had assumed that since packages in the local cache are already available, staging them would be redundant.

My logic is as follows:

  1. Identify all packages required by the requirements.txt file.
  2. Identify all other required PyPI packages.
  3. Download any of these packages that are not already in the cache.
  4. Stage only the newly downloaded packages.

You've correctly pointed out that this means cached packages aren't staged. To make sure I get the fix right, could you help me understand the downstream process and why it's necessary to stage the cached packages as well?

Copy link
Contributor

@tvalentyn tvalentyn Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user supplies a --requirements_file option, Beam stages packages to allow a runner execute a pipeline even if the runner environment doesn't have access to PyPI to download the packages on the fly.

To stage packages, we download the packages into the local requirements_cache folder, and then stage the entire folder. The disadvantage is that overtime the requirements_cache folder might have some other packages no longer in requirements.txt. That can cause additional uploads of files that are not necessary. Possible solutions:

  • Clean the requirements cache folder periodically: rm -rf /tmp/dataflow-requirements-cache
  • Use a custom container image (--sdk_container_image) instead of the --requirements_file, and install the packages in your image. This is a recommended option to have self-contained reproducible pipeline environments.
  • Don't stage requirements cache with --requirements_cache=skip (pipeline will depend on PyPI at runtime).

Copy link
Contributor

@tvalentyn tvalentyn Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re how to improve the logic, I looked at the discussion we had on this topic:

https://lists.apache.org/thread/pqc2yl15kjdpxfp3pnocrrhkk3m6gsmh

and there are couple of ideas:

  1. Parse log output to infer dependencies that were downloaded but also note down files that were skipped, because they already existed in the cache (likely this will be brittle because it depends on pip having certain output formats)
  2. Download twice (https://lists.apache.org/thread/v35bgj67hqrwl4ldymo8bqkybgt3z096), something like the following (haven't tested):
pip download --dest /tmp/dataflow_requirements_cache -r requirements.txt --exists-action i --no-deps

pip download --dest /tmp/temporary_folder_that_will_be_cleaned_up -r requirements.txt --find-links /tmp/dataflow_requirements_cache

then, stage deps from temporary_folder_that_will_be_cleaned_up.

@tarun-google
Copy link
Contributor

@ksobrenat32 this PR is stale for a week. Please resolve comments

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @liferoad

@tvalentyn
Copy link
Contributor

waiting on author

@ksobrenat32
Copy link
Contributor Author

I’m working on this in my free time, but I’ve had a few rough weeks lately. I’ll get back to it when I have more free time

I hope this isn’t blocking anyone

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @liferoad

@liferoad
Copy link
Contributor

waiting on author

@tvalentyn
Copy link
Contributor

hmm, somehow our pr bot seems to be ignoring the waiting on author command.

@tvalentyn
Copy link
Contributor

tvalentyn commented Oct 29, 2025

ah, looks like the last comment from the author passed the ball back into reviewer's court again. ok now the bot should be quiet.

I hope this isn’t blocking anyone

Yes, this issue is not urgent as there are workarounds mentioned in #36249 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python SDK should stage only up-to-date versions of pipeline dependencies defined by requirements file.

4 participants