Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake state growing too fast #2186

Open
julesmga opened this issue Jan 3, 2025 · 1 comment
Open

Snowflake state growing too fast #2186

julesmga opened this issue Jan 3, 2025 · 1 comment

Comments

@julesmga
Copy link

julesmga commented Jan 3, 2025

dlt version

1.5.0

Describe the problem

Our pipeline state tables have grown upwards of 20 gigabytes after only a month of operation, which leads to heavy IO on snowflake and considerable slowdowns in our pipelines (restoring the state takes at least 15 minutes across the board). We haven't found a way to pass restore_from_destination in an Airflow context, and it is unclear whether that's the way forward, or if it's a bug that needs to be addressed. We don't add anything to the state, to the best of our understanding, it should only contain incremental information, which we may not even need since we are looking into using the airflow context anyway.

Expected behavior

The state should be relatively small if it only contains the incremental ranges.

Steps to reproduce

Use DLT with Snowflake as destination with an incremental strategy.

Operating system

Linux

Runtime environment

Google Cloud Composer

Python version

3.11

dlt data source

SQLAlchemy (Snowflake)

dlt destination

Snowflake

Other deployment details

We are using a GCS external stage for loading to the Snowflake destination.

Additional information

No response

@julesmga
Copy link
Author

julesmga commented Jan 3, 2025

Example query:

SELECT "VERSION", "ENGINE_VERSION", "PIPELINE_NAME", "STATE", "CREATED_AT", "_DLT_LOAD_ID" FROM "MY_SCHEMA"."_DLT_PIPELINE_STATE" AS s JOIN "MY_SCHEMA"."_DLT_LOADS" AS l ON l."LOAD_ID" = s."_DLT_LOAD_ID" WHERE "PIPELINE_NAME" = 'my_pipeline' AND l."STATUS" = 0 ORDER BY "LOAD_ID" DESC

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant