Solution for #19968, python sdk justs stage up-to-date versions on the required files #36249

ksobrenat32 · 2025-09-23T18:13:21Z

Refactor how Python package dependencies are staged and cached in the Apache Beam portability runner. The main change is to track and use the exact set of downloaded package files, rather than globbing all files in the cache directory. Also updated the tests accordingly.

fixes Python SDK should stage only up-to-date versions of pipeline dependencies defined by requirements file. #19968

… on the required files This was achieved by saving a list of dependencies and downloading only those files

gemini-code-assist · 2025-09-23T18:13:37Z

Summary of Changes

Hello @ksobrenat32, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the Python package dependency staging mechanism within the Apache Beam portability runner. The core change involves transitioning from a generic glob-based discovery of cached packages to an explicit tracking of precisely downloaded package files. This enhancement aims to improve the accuracy and reliability of dependency management by ensuring that only the intended and verified packages are staged for execution, thereby preventing potential issues arising from ambiguous file selection.

Highlights

Explicit Package Tracking: The system now explicitly tracks and utilizes the exact file paths of downloaded Python packages for staging, moving away from a broad glob-based approach that could inadvertently include unintended files from the cache directory.
populate_requirements_cache Return Value: The populate_requirements_cache function in stager.py has been updated to return a list of strings, representing the absolute paths of the successfully downloaded packages, which are then used for staging.
Test Updates: Corresponding test cases in stager_test.py have been modified to reflect the new return type and behavior of the populate_requirements_cache function, ensuring accurate testing of the updated staging logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-09-23T19:34:56Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ksobrenat32 · 2025-09-24T13:49:21Z

assign set of reviewers

github-actions · 2025-09-24T13:50:32Z

Assigning reviewers:

R: @liferoad for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

tvalentyn · 2025-09-24T23:29:45Z

sdks/python/apache_beam/runners/portability/stager.py

-      processes.check_output(cmd_args, stderr=processes.STDOUT)
+      output = processes.check_output(cmd_args, stderr=subprocess.STDOUT)
+      downloaded_packages = []
+      for line in output.decode('utf-8').split('\n'):


what happens to packages that were previously in the requirements cache, hence not downloaded?

Packages already in the requirements cache will not appear in the output, this is because pip will not try to download them again

that means they won't be staged, right?

Exactly, just the packages needed

my point is, the packages already in the local cache dir, but also still required in requirements.txt won't get staged, and this is not as intended. Does this also match with your understanding?

My initial understanding, as reflected in the current implementation, was that we only needed to stage newly downloaded or updated packages. I had assumed that since packages in the local cache are already available, staging them would be redundant.

My logic is as follows:

Identify all packages required by the requirements.txt file.

Identify all other required PyPI packages.

Download any of these packages that are not already in the cache.

Stage only the newly downloaded packages.

You've correctly pointed out that this means cached packages aren't staged. To make sure I get the fix right, could you help me understand the downstream process and why it's necessary to stage the cached packages as well?

When a user supplies a --requirements_file option, Beam stages packages to allow a runner execute a pipeline even if the runner environment doesn't have access to PyPI to download the packages on the fly.

To stage packages, we download the packages into the local requirements_cache folder, and then stage the entire folder. The disadvantage is that overtime the requirements_cache folder might have some other packages no longer in requirements.txt. That can cause additional uploads of files that are not necessary. Possible solutions:

Clean the requirements cache folder periodically: rm -rf /tmp/dataflow-requirements-cache

Use a custom container image (--sdk_container_image) instead of the --requirements_file, and install the packages in your image. This is a recommended option to have self-contained reproducible pipeline environments.

Don't stage requirements cache with --requirements_cache=skip (pipeline will depend on PyPI at runtime).

Re how to improve the logic, I looked at the discussion we had on this topic:

https://lists.apache.org/thread/pqc2yl15kjdpxfp3pnocrrhkk3m6gsmh

and there are couple of ideas:

Parse log output to infer dependencies that were downloaded but also note down files that were skipped, because they already existed in the cache (likely this will be brittle because it depends on pip having certain output formats)

Download twice (https://lists.apache.org/thread/v35bgj67hqrwl4ldymo8bqkybgt3z096), something like the following (haven't tested):

pip download --dest /tmp/dataflow_requirements_cache -r requirements.txt --exists-action i --no-deps pip download --dest /tmp/temporary_folder_that_will_be_cleaned_up -r requirements.txt --find-links /tmp/dataflow_requirements_cache

then, stage deps from temporary_folder_that_will_be_cleaned_up.

tarun-google · 2025-10-03T16:17:08Z

@ksobrenat32 this PR is stale for a week. Please resolve comments

github-actions · 2025-10-16T12:15:57Z

Reminder, please take a look at this pr: @liferoad

tvalentyn · 2025-10-16T20:04:12Z

waiting on author

ksobrenat32 · 2025-10-21T21:25:23Z

I’m working on this in my free time, but I’ve had a few rough weeks lately. I’ll get back to it when I have more free time

I hope this isn’t blocking anyone

github-actions · 2025-10-29T12:16:41Z

Reminder, please take a look at this pr: @liferoad

liferoad · 2025-10-29T13:10:42Z

waiting on author

tvalentyn · 2025-10-29T18:54:41Z

hmm, somehow our pr bot seems to be ignoring the waiting on author command.

tvalentyn · 2025-10-29T18:56:15Z

ah, looks like the last comment from the author passed the ball back into reviewer's court again. ok now the bot should be quiet.

I hope this isn’t blocking anyone

Yes, this issue is not urgent as there are workarounds mentioned in #36249 (comment)

Solution for apache#19968, python sdk justs stage up-to-date versions…

a01ae3e

… on the required files This was achieved by saving a list of dependencies and downloading only those files

github-actions bot added python runners labels Sep 23, 2025

Improve test case formatting in Stager class

23d06ad

ksobrenat32 force-pushed the pythonsdk-dep-ver branch from 723eab5 to 23d06ad Compare September 24, 2025 02:32

github-actions bot added the Next Action: Reviewers label Sep 24, 2025

liferoad requested a review from tvalentyn September 24, 2025 15:53

tvalentyn reviewed Sep 24, 2025

View reviewed changes

github-actions bot added the slow-review label Oct 16, 2025

github-actions bot added Next Action: Author and removed Next Action: Reviewers slow-review labels Oct 16, 2025

github-actions bot added Next Action: Reviewers and removed Next Action: Author labels Oct 21, 2025

github-actions bot added the slow-review label Oct 29, 2025

github-actions bot added Next Action: Author and removed Next Action: Reviewers slow-review labels Oct 29, 2025

Solution for #19968, python sdk justs stage up-to-date versions on the required files #36249

Are you sure you want to change the base?

Solution for #19968, python sdk justs stage up-to-date versions on the required files #36249

Conversation

ksobrenat32 commented Sep 23, 2025

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

ksobrenat32 commented Sep 24, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

tvalentyn Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

ksobrenat32 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

tvalentyn Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

ksobrenat32 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

tvalentyn Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

ksobrenat32 Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

tvalentyn Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarun-google commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

tvalentyn commented Oct 16, 2025

Uh oh!

ksobrenat32 commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

liferoad commented Oct 29, 2025

Uh oh!

tvalentyn commented Oct 29, 2025

Uh oh!

tvalentyn commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tvalentyn Oct 8, 2025 •

edited

Loading

tvalentyn Oct 8, 2025 •

edited

Loading

tvalentyn commented Oct 29, 2025 •

edited

Loading