Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/added s3 region related to 2349 #2389

Open
wants to merge 6 commits into
base: devel
Choose a base branch
from

Conversation

yannik207
Copy link

Description

the pr is focusing on adding the s3 region for the copy into job

Related Issues

Additional Context

Is the default value correct, or should I switch to another region or just do not use a default value

Copy link

netlify bot commented Mar 9, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 683b91a
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/67d09d4e8fcb130008758029
😎 Deploy Preview https://deploy-preview-2389--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -65,25 +65,27 @@ def __init__(
file_path: str,
staging_credentials: Optional[CredentialsConfiguration] = None,
staging_iam_role: str = None,
s3_region: str = "us-east-1", # Add region as a parameter
Copy link
Collaborator

@djudjuu djudjuu Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi julius here, just onboarding to the dlt team, so maybe some questions i'll ask are obvious or me lacking context and/or understanding

issue: what is the consequence of having the wrong region as (default) parameter. here? lets say my data actually lies in europe, and I am not specifying it.
my hunch is not having a default makes more sense here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Julius,
Yeah that's what I asked in the issue #2349 :D
but I think it make sense to swap it to not specifying it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another suggestion might be to extract the region via boto3: something like this:

import boto3 s3 = boto3.client("s3") bucket_region = s3.get_bucket_location(Bucket=bucket_name)["LocationConstraint"]

Copy link
Collaborator

@djudjuu djudjuu Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I asked in the issue #2349 :D

true :)
so let's not have a default value.
I think its cleanes to make it optional string and load it from the config. We would then have to add it to RedshiftClientConfiguration-dataclass (redshift/configuration.py)(see other comment). We would then add a section to the docs, that warns about your issue, saying that you have to define the staging-region if you want to copy across regions.

either here: https://dlthub.com/docs/dlt-ecosystem/destinations/redshift#authentication-iam-role

I am wondering though if there is an even cleaner way....I have to dig a bit deeper and understand how credentials are resolved. I'll get back to you shortly

btw: the test we have to keep green is this one:

poetry run pytest tests/load/pipeline/test_pipelines.py::test_parquet_loading -k "redshift-parquet-staging-s3-role"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import boto3 s3 = boto3.client("s3") bucket_region = s3.get_bucket_location(Bucket=bucket_name)["LocationConstraint"]

hmm, my first reaction is that this would get messy. what if the role didnt get the right to query the region? I think its cleaner to keep it simple and have good docs for the config

@@ -65,9 +65,11 @@ def __init__(
file_path: str,
staging_credentials: Optional[CredentialsConfiguration] = None,
staging_iam_role: str = None,
staging_region_name: str = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would have to be made part of the RedshiftClientConfiguration and then passed to the load-job

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd thought here: if we add to the configuration, we will have to add a lot of tests, i assume so maybe let me have a chat with my colleagues here
@sh-rp

if region_name:
credentials = (
"CREDENTIALS"
f" 'aws_access_key_id={aws_access_key};aws_secret_access_key={aws_secret_key};region={region_name};'"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if region is unspecified, than there shouldnbt a "region=" at the end of the string (or tempvar)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that's what I am doing if region_name is unspecified it will use the query template mentioned above. that's why I initially thought it would be easier to append the region if it is specified and to use two "different" templates :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When using s3 as a staging destination for Redshift, allow users to specify region of s3 bucket
2 participants