Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake COPY INTO cannot use Snowflake S3 Stage #2342

Open
stevenayers opened this issue Feb 24, 2025 · 3 comments · May be fixed by #2354
Open

Snowflake COPY INTO cannot use Snowflake S3 Stage #2342

stevenayers opened this issue Feb 24, 2025 · 3 comments · May be fixed by #2354
Assignees
Labels
bug Something isn't working

Comments

@stevenayers
Copy link

dlt version

1.7.0

Describe the problem

Unless your Snowflake S3 stage is the root of an S3 bucket, gen_copy_sql interpolates the relative path of your staged files incorrectly.

The ingestion will fail, because we're using parsed_file_url.path.lstrip('/') to get the path of the staging file, which doesn't take into account that the snowflake s3 stage may make up part of the parsed_file_url.path.

if bucket_scheme in AZURE_BLOB_STORAGE_PROTOCOLS + S3_PROTOCOLS and stage_name:
from_clause = f"FROM '@{stage_name}'"
files_clause = f"FILES = ('{parsed_file_url.path.lstrip('/')}')"
# referencing an staged files via a bucket URL requires explicit AWS credentials

Example Code:

import os
import dlt
from dlt.destinations import snowflake, filesystem

MY_STAGE_PATH = "s3://my-bucket-name/dlt/"

@dlt.resource
def jaffle_shop():
    for i in range(10):
        yield {"id": i, "name": f"This is item {i}"}

destination = snowflake(
    credentials={
        "host": os.environ["SNOWFLAKE_ACCOUNT"],
        "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
        "database": f"{os.environ['SNOWFLAKE_DATABASE']}",
        "role": os.environ["SNOWFLAKE_ROLE"],
        "username": os.environ["SNOWFLAKE_USER"],
        "password": os.environ["SNOWFLAKE_PASSWORD"],
    },
    stage_name="STAGING.DLT" # Set the path on this to MY_STAGE_PATH
)

staging = filesystem(
    bucket_url=MY_STAGE_PATH,
    credentials={
        "aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
        "aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
        "region_name": os.environ["AWS_REGION"],
    },
)

pipeline = dlt.pipeline(
    pipeline_name="test",
    destination=destination,
    staging=staging,
    dataset_name="my_dataset",
)

pipline.run(jaffle_shop)

The staging files are saved to: s3://my-bucket-name/dlt/my_dataset/jaffle_shop/
But the COPY INTO looks for files at: s3://my-bucket-name/dlt/dlt/my_dataset/jaffle_shop/

Generated COPY INTO

COPY INTO "my_dataset"."jaffle_shop"
FROM '@STAGING.DLT'
FILES = ('dlt/my_dataset/jaffle_shop/1740333014.947686.5fc1448d6d.jsonl')
FILE_FORMAT = ( TYPE = 'JSON', BINARY_FORMAT = 'BASE64' )
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE'

Error:

Remote file 's3://my-bucket-name/dlt/dlt/my_dataset/jaffle_shop/1740333014.947686.5fc1448d6d.jsonl' was not found. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

Expected behavior

gen_copy_sql should generate:

COPY INTO "my_dataset"."jaffle_shop"
FROM '@STAGING.DLT'
FILES = ('my_dataset/jaffle_shop/1740333014.947686.5fc1448d6d.jsonl')
FILE_FORMAT = ( TYPE = 'JSON', BINARY_FORMAT = 'BASE64' )
MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE'

Steps to reproduce

  1. Create a Snowflake External S3 Stage with a location that is a bucket name + prefix (s3://my-bucket-name/dlt/)
  2. Run the following code:
import os
import dlt
from dlt.destinations import snowflake, filesystem

MY_STAGE_PATH = "s3://my-bucket-name/dlt/"

@dlt.resource
def jaffle_shop():
    for i in range(10):
        yield {"id": i, "name": f"This is item {i}"}

destination = snowflake(
    credentials={
        "host": os.environ["SNOWFLAKE_ACCOUNT"],
        "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
        "database": f"{os.environ['SNOWFLAKE_DATABASE']}",
        "role": os.environ["SNOWFLAKE_ROLE"],
        "username": os.environ["SNOWFLAKE_USER"],
        "password": os.environ["SNOWFLAKE_PASSWORD"],
    },
    stage_name="STAGING.DLT" # Set the path on this to MY_STAGE_PATH
)

staging = filesystem(
    bucket_url=MY_STAGE_PATH,
    credentials={
        "aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
        "aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
        "region_name": os.environ["AWS_REGION"],
    },
)

pipeline = dlt.pipeline(
    pipeline_name="test",
    destination=destination,
    staging=staging,
    dataset_name="my_dataset",
)

pipline.run(jaffle_shop)

Operating system

macOS

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

Snowflake

Other deployment details

No response

Additional information

No response

@sh-rp sh-rp added the bug Something isn't working label Feb 24, 2025
@sh-rp sh-rp assigned sh-rp and unassigned sh-rp Feb 24, 2025
@sh-rp
Copy link
Collaborator

sh-rp commented Feb 24, 2025

Hey @stevenayers , thanks for raising this, I will move it to planned. Would you be up to providing a PR for this?

@sh-rp sh-rp moved this from Todo to Planned in dlt core library Feb 24, 2025
@stevenayers
Copy link
Author

Hey @stevenayers , thanks for raising this, I will move it to planned. Would you be up to providing a PR for this?

Sure 👍

stevenayers added a commit to stevenayers/dlt that referenced this issue Feb 26, 2025
@stevenayers stevenayers linked a pull request Feb 26, 2025 that will close this issue
@stevenayers
Copy link
Author

@sh-rp PR raised 👍 #2354

@rudolfix rudolfix self-assigned this Mar 1, 2025
stevenayers added a commit to stevenayers/dlt that referenced this issue Mar 4, 2025
@rudolfix rudolfix moved this from Planned to In Progress in dlt core library Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants