Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp: pygwalker data explorer in streamlit #2216

Draft
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Jan 14, 2025

Description

Add a page to the streamlit dashboard available via dlt pipeline NAME show. It uses the recent Pipeline.dataset interface to pass a dataframe to pygwalker.

2025-01-14.16-15-45.mp4

Goals

When first loading a pipeline, I want to know the properties of the data (null, distribution, outliers). It's reducing the friction of having to create temporary notebooks and install ipykernel/jupyter dependencies

Implementation

  • pygwalker is an open-source Python library that produces an interactive widget to explore dataframes (works in Jupyter, VSCode, Streamlit, Marimo) based on Vega.

  • Integration surface area is small; only assumes that you can pull the data locally via the Pipeline.dataset feature.

    • Deeper integration could use Pipeline.dataset to query data or pass an sqlalchemy connection to pygwalker to handle data loading.
    • pygwalker can use duckdb for efficient processing of data

TODO

  • add useful message for optional/missing dependencies

Copy link

netlify bot commented Jan 14, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit d748340
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6786d2f89dfa550008c73f7f
😎 Deploy Preview https://deploy-preview-2216--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rudolfix rudolfix requested a review from sh-rp January 20, 2025 11:19
def pygwalker_renderer(pipeline_name: str, table_name: str) -> StreamlitRenderer:
pipeline = dlt.attach(pipeline_name)
dataset = pipeline.dataset()
df = dataset[table_name].df()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will load the full table into memory, is this desired? on a large table this will take a long time and at some point kill the host. is there some other way we could do this?

from dlt.helpers.streamlit_app.utils import render_with_pipeline


def show(pipeline: dlt.Pipeline) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have at least simple tests for this page to make sure it renders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants