- 
                Notifications
    
You must be signed in to change notification settings  - Fork 4.4k
 
Solution for #19968, python sdk justs stage up-to-date versions on the required files #36249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… on the required files This was achieved by saving a list of dependencies and downloading only those files
          Summary of ChangesHello @ksobrenat32, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the Python package dependency staging mechanism within the Apache Beam portability runner. The core change involves transitioning from a generic glob-based discovery of cached packages to an explicit tracking of precisely downloaded package files. This enhancement aims to improve the accuracy and reliability of dependency management by ensuring that only the intended and verified packages are staged for execution, thereby preventing potential issues arising from ambiguous file selection. Highlights
 Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either  
 Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a  Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
  | 
    
| 
           Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment   | 
    
723eab5    to
    23d06ad      
    Compare
  
    | 
           assign set of reviewers  | 
    
| 
           Assigning reviewers: R: @liferoad for label python. Note: If you would like to opt out of this review, comment  Available commands: 
 The PR bot will only process comments in the main thread (not review comments).  | 
    
| processes.check_output(cmd_args, stderr=processes.STDOUT) | ||
| output = processes.check_output(cmd_args, stderr=subprocess.STDOUT) | ||
| downloaded_packages = [] | ||
| for line in output.decode('utf-8').split('\n'): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens to packages that were previously in the requirements cache, hence not downloaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packages already in the requirements cache will not appear in the output, this is because pip will not try to download them again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that means they won't be staged, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, just the packages needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is, the packages already in the local cache dir, but also still required in requirements.txt won't get staged, and this is not as intended. Does this also match with your understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial understanding, as reflected in the current implementation, was that we only needed to stage newly downloaded or updated packages. I had assumed that since packages in the local cache are already available, staging them would be redundant.
My logic is as follows:
- Identify all packages required by the requirements.txt file.
 - Identify all other required PyPI packages.
 - Download any of these packages that are not already in the cache.
 - Stage only the newly downloaded packages.
 
You've correctly pointed out that this means cached packages aren't staged. To make sure I get the fix right, could you help me understand the downstream process and why it's necessary to stage the cached packages as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a user supplies a --requirements_file option, Beam stages packages to allow a runner execute a pipeline even if the runner environment doesn't have access to PyPI to download the packages on the fly.
To stage packages, we download the packages into the local requirements_cache folder, and then stage the entire folder. The disadvantage is that overtime the requirements_cache folder might have some other packages no longer in requirements.txt. That can cause additional uploads of files that are not necessary. Possible solutions:
- Clean the requirements cache folder periodically: 
rm -rf /tmp/dataflow-requirements-cache - Use a custom container image (
--sdk_container_image) instead of the--requirements_file, and install the packages in your image. This is a recommended option to have self-contained reproducible pipeline environments. - Don't stage requirements cache with 
--requirements_cache=skip(pipeline will depend on PyPI at runtime). 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re how to improve the logic, I looked at the discussion we had on this topic:
https://lists.apache.org/thread/pqc2yl15kjdpxfp3pnocrrhkk3m6gsmh
and there are couple of ideas:
- Parse log output to infer dependencies that were downloaded but also note down files that were skipped, because they already existed in the cache (likely this will be brittle because it depends on pip having certain output formats)
 - Download twice (https://lists.apache.org/thread/v35bgj67hqrwl4ldymo8bqkybgt3z096), something like the following (haven't tested):
 
pip download --dest /tmp/dataflow_requirements_cache -r requirements.txt --exists-action i --no-deps
pip download --dest /tmp/temporary_folder_that_will_be_cleaned_up -r requirements.txt --find-links /tmp/dataflow_requirements_cache
then, stage deps from temporary_folder_that_will_be_cleaned_up.
| 
           @ksobrenat32 this PR is stale for a week. Please resolve comments  | 
    
| 
           Reminder, please take a look at this pr: @liferoad  | 
    
| 
           waiting on author  | 
    
| 
           I’m working on this in my free time, but I’ve had a few rough weeks lately. I’ll get back to it when I have more free time I hope this isn’t blocking anyone  | 
    
| 
           Reminder, please take a look at this pr: @liferoad  | 
    
| 
           waiting on author  | 
    
| 
           hmm, somehow our pr bot seems to be ignoring the   | 
    
| 
           ah, looks like the last comment from the author passed the ball back into reviewer's court again. ok now the bot should be quiet. 
 Yes, this issue is not urgent as there are workarounds mentioned in #36249 (comment)  | 
    
Refactor how Python package dependencies are staged and cached in the Apache Beam portability runner. The main change is to track and use the exact set of downloaded package files, rather than globbing all files in the cache directory. Also updated the tests accordingly.