Skip to content

Add S3CopyPrefixOperator to copy all objects under a prefix#68946

Merged
o-nikolas merged 2 commits into
apache:mainfrom
bentorb:s3-operator
Jun 26, 2026
Merged

Add S3CopyPrefixOperator to copy all objects under a prefix#68946
o-nikolas merged 2 commits into
apache:mainfrom
bentorb:s3-operator

Conversation

@bentorb

@bentorb bentorb commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Description

This PR introduces a new S3CopyPrefixOperator that enables copying all S3 objects under a specified prefix from a source bucket to a destination bucket. This operator fills a gap in the current S3 operators by providing prefix-based bulk copy functionality.

What does this operator do?

• Copies all objects matching a specified prefix from source to destination S3 bucket
• Supports cross-bucket copying
• Provides configurable error handling (continue on failure or stop on first error)
• Integrates with OpenLineage for data lineage tracking
• Supports Airflow templating for dynamic parameter values

Why is this needed?

Currently, Airflow's S3 operators allow copying individual objects. For use cases involving copying entire "directory" structures or large numbers of objects sharing a common prefix, users must implement custom solutions or use multiple operator instances.
This operator provides a native, efficient solution for prefix-based bulk operations.

Key Features

Error Handling: Configurable continue_on_failure parameter for resilient operations
Template Fields: All dynamic parameters support Jinja templating
OpenLineage Integration: Automatic data lineage tracking for copied objects
Standard Exception Handling: Uses RuntimeError instead of AirflowException per project conventions

Testing

Includes 10 new unit tests (11 test cases) covering:

  • Basic prefix copying, same-bucket copying, and empty-prefix handling
  • Full s3:// URL inputs and invalid bucket/URL combinations
  • Error scenarios and continue_on_failure behaviour
  • OpenLineage integration (bucket+prefix and s3:// URL variants)
  • Template field validation

System test integration in providers/amazon/tests/system/amazon/aws/example_s3.py
All tests pass in Breeze testing environment

Usage Example

copy_prefix = S3CopyPrefixOperator(
    task_id='copy_data_files',                                                                                                                                                                                                                                                       
    source_bucket_name='source-bucket',                                                                                                                                                                                                                                              
    source_bucket_prefix='data/2023/',                                                                                                                                                                                                                                                  
    dest_bucket_name='dest-bucket',                                                                                                                                                                                                                                                  
    dest_bucket_prefix='archive/data/2023/',                                                                                                                                                                                                                                            
    continue_on_failure=True,                                                                                                                                                                                                                                                        
    aws_conn_id='aws_default'                                                                                                                                                                                                                                                        
)                                                                                                                                                                                                                                                                                    

Checklist

• [x] Tests included (10 comprehensive unit tests)
• [x] Documentation updated
• [x] Code follows project coding standards
• [x] All static code checks pass
• [x] Apache license headers added
• [x] PR is focused on single feature
• [x] Local tests pass
• [x] No unrelated changes included

Was generative AI tooling used to co-author this PR?
  • Yes

Generated-by: Claude Code (claude-sonnet-4-6) following the guidelines

@bentorb bentorb requested a review from o-nikolas as a code owner June 24, 2026 13:48
@bentorb bentorb force-pushed the s3-operator branch 3 times, most recently from f122c71 to 0e560de Compare June 24, 2026 17:46
@eladkal eladkal requested review from eladkal and vincbeck June 24, 2026 18:30

@o-nikolas o-nikolas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good, thanks for the contribution!

In the future we might want to parallelize the copy since prefixes may include thousands or millions of objects depending on the situation. But this is a great start.

Comment thread providers/amazon/tests/unit/amazon/aws/operators/test_s3.py
Comment thread providers/amazon/src/airflow/providers/amazon/aws/operators/s3.py Outdated
@bentorb bentorb force-pushed the s3-operator branch 2 times, most recently from 75b71de to 25e826f Compare June 25, 2026 08:28

@eladkal eladkal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bentorb bentorb force-pushed the s3-operator branch 2 times, most recently from 7f2f917 to 1cc3ef8 Compare June 25, 2026 09:29
@bentorb bentorb force-pushed the s3-operator branch 2 times, most recently from cfe4257 to ec1db49 Compare June 25, 2026 13:57
@bentorb

bentorb commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thank you all for taking the time to check this out :)

@o-nikolas does you approval mean we're good to go or would you still prefer if I address your comments?

@vincbeck

Copy link
Copy Markdown
Contributor

Thank you all for taking the time to check this out :)

@o-nikolas does you approval mean we're good to go or would you still prefer if I address your comments?

I think they are nice to have, so if you have a bit of time to address them, that would be awesome :)

@bentorb

bentorb commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the heads up, makes sense @vincbeck

I just pushed another commit with the adjustments.

@bentorb bentorb force-pushed the s3-operator branch 7 times, most recently from ea39882 to 6cfd917 Compare June 26, 2026 13:35
S3CopyObjectOperator handles a single object at a time. Users who need
to copy all objects sharing a prefix must implement their own pagination,
error handling, and encryption support. This operator encapsulates that
pattern so it can be used directly in a Dag.
Chain RuntimeError from the original exception to preserve the traceback.
Verify the succeeding object is attempted when continue_on_failure=True.
@o-nikolas o-nikolas merged commit 5a8409c into apache:main Jun 26, 2026
81 checks passed
@boring-cyborg

boring-cyborg Bot commented Jun 26, 2026

Copy link
Copy Markdown

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@bentorb bentorb deleted the s3-operator branch June 27, 2026 12:46
karenbraganz pushed a commit to karenbraganz/airflow that referenced this pull request Jun 30, 2026
…8946)

S3CopyObjectOperator handles a single object at a time. Users who need
to copy all objects sharing a prefix must implement their own pagination,
error handling, and encryption support. This operator encapsulates that
pattern so it can be used directly in a Dag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants