-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/added s3 region related to 2349 #2389
base: devel
Are you sure you want to change the base?
Fix/added s3 region related to 2349 #2389
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -65,25 +65,27 @@ def __init__( | |||
file_path: str, | |||
staging_credentials: Optional[CredentialsConfiguration] = None, | |||
staging_iam_role: str = None, | |||
s3_region: str = "us-east-1", # Add region as a parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi julius here, just onboarding to the dlt team, so maybe some questions i'll ask are obvious or me lacking context and/or understanding
issue: what is the consequence of having the wrong region as (default) parameter. here? lets say my data actually lies in europe, and I am not specifying it.
my hunch is not having a default makes more sense here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Julius,
Yeah that's what I asked in the issue #2349 :D
but I think it make sense to swap it to not specifying it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another suggestion might be to extract the region via boto3: something like this:
import boto3 s3 = boto3.client("s3") bucket_region = s3.get_bucket_location(Bucket=bucket_name)["LocationConstraint"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I asked in the issue #2349 :D
true :)
so let's not have a default value.
I think its cleanes to make it optional string and load it from the config. We would then have to add it to RedshiftClientConfiguration-dataclass (redshift/configuration.py
)(see other comment). We would then add a section to the docs, that warns about your issue, saying that you have to define the staging-region if you want to copy across regions.
either here: https://dlthub.com/docs/dlt-ecosystem/destinations/redshift#authentication-iam-role
I am wondering though if there is an even cleaner way....I have to dig a bit deeper and understand how credentials are resolved. I'll get back to you shortly
btw: the test we have to keep green is this one:
poetry run pytest tests/load/pipeline/test_pipelines.py::test_parquet_loading -k "redshift-parquet-staging-s3-role"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import boto3 s3 = boto3.client("s3") bucket_region = s3.get_bucket_location(Bucket=bucket_name)["LocationConstraint"]
hmm, my first reaction is that this would get messy. what if the role didnt get the right to query the region? I think its cleaner to keep it simple and have good docs for the config
@@ -65,9 +65,11 @@ def __init__( | |||
file_path: str, | |||
staging_credentials: Optional[CredentialsConfiguration] = None, | |||
staging_iam_role: str = None, | |||
staging_region_name: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would have to be made part of the RedshiftClientConfiguration
and then passed to the load-job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd thought here: if we add to the configuration, we will have to add a lot of tests, i assume so maybe let me have a chat with my colleagues here
@sh-rp
if region_name: | ||
credentials = ( | ||
"CREDENTIALS" | ||
f" 'aws_access_key_id={aws_access_key};aws_secret_access_key={aws_secret_key};region={region_name};'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if region is unspecified, than there shouldnbt a "region=" at the end of the string (or tempvar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that's what I am doing if region_name is unspecified it will use the query template mentioned above. that's why I initially thought it would be easier to append the region if it is specified and to use two "different" templates :)
Description
the pr is focusing on adding the s3 region for the copy into job
Related Issues
Additional Context
Is the default value correct, or should I switch to another region or just do not use a default value