-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add advanced project tutorial #2338
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
a611425
to
dd6923e
Compare
dd6923e
to
a4f149e
Compare
datasets: | ||
my_pipeline_dataset: | ||
destination: | ||
- bigquery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duckdb
overall i loved the experience what stood out for me
Next steps in my head: Figure out how I can get a rudimentary dag from this so I can turn it into a deployment Questions I have that were not answered during the experience
|
was super easy to use, and the experience (of using the unsure of what im doing wrong with the catalog commands - i can access the data objects (iterables) but cant quite load dataframes from it, had to go check if the runner worked - it did! the data was loaded right but had trouble loading the dataframes themselves. was unsure of how to run a particular source (in the case of multiple sources) when running the runner, because i think it runs all sources mentioned in the yaml file. so i had to make those changes in the yaml file - declare a particular source if i only want to run that, and not all of them, and then use the runner. would be easier if the run_pipeline() function could also table a source name alongside a pipeline name. all in all, super easy experience, specially if we're already used to dlt 🤩 |
|
||
## Implicit entities | ||
|
||
By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be easier to copy the command for the reader if you use code block for dlt project init arrow duckdb --package my_dlt_project
- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md) | ||
- you have completed the [basic project tutorial](./tutorial.md) | ||
|
||
## Implicit entities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though "entities" were mentioned in basic tutorial it would be better to elaborate what exactly they are or give some clarification (e.g. sources, destinations)
|
||
## Implicit entities | ||
|
||
By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create an example project again
An improvement would be to explicitly link the previous tutorial e.g. something like "Let's create the same project as in basic tutorial (link) with ..."
dataset_name: my_pipeline_dataset | ||
``` | ||
|
||
If you run the pipeline with `dlt pipeline my_pipeline run`, dlt+ will use the source and the destination referenced by the pipeline configuration. Datasets are not defined in this example, so dlt+ will create an implicit dataset with the name as defined in the pipeline configuration. For this example, you do not even need the duckdb destination defined, as this would also be implicitely created for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding dlt[duckdb]
and pyarrow
to Prerequisites
Otherwise, when I run dlt pipeline my_pipeline run
on a clean venv I get:
First run:
Pipeline execution failed at stage sync with exception:
<class 'dlt.common.exceptions.MissingDependencyException'>
You must install additional dependencies to run duckdb destination. If you use pip you may do the following:
pip install "dlt[duckdb]"
Dependencies for specific destinations are available as extras of dlt
After I install dlt[duckdb]
, and run the pipeline again:
Pipeline execution failed at stage extract when processing package 1740762841.0444548 with exception:
<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe items: extraction of resource items in generator items caused an exception: No module named 'pyarrow'
|
||
## Managing Datasets and Destinations | ||
|
||
If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following: | |
If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to define on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following: |
Note that the destination property is an array of destinations on which this dataset may be materialized. More on that in the next section. | ||
|
||
|
||
## Managing Datasets and Destinations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Managing Datasets and Destinations | |
## Managing datasets and destinations |
Consider using the sentence case for headers
``` | ||
|
||
:::tip | ||
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below. | |
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in Python, more on that below. |
|
||
## Interacting with the project context in code | ||
|
||
You can also interact with your dlt+ project directly in code as opposed to running cli commands. Let's use the same starting point as above by running `dlt project init arrow duckdb`. Now add a new python file in the root of your project called `do_something.py` and add the following code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above starting point was dlt project init arrow duckdb --package my_dlt_project
so when I run dlt project init arrow duckdb
in the same directory I see that new files are added to the current dir and not to my_dlt_project
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it's best either to instruct the user to clean the directory or to use the same init commands
|
||
Run the above with `python do_something.py` and see the output. | ||
|
||
The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`. | |
The `current` module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`. |
Mark current
as code if we're talking about what we imported from dlt_plus
|
||
Run the above with `python do_something.py` and see the output. | ||
|
||
The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more elegant if project()
in current.project()
was an attribute rather than a function: current.project
. I know it's an implementation detail but it is most likely not relevant for the end user. (same for the other functions in current
)
print(current.project().project_dir) | ||
# show the project config | ||
print(current.project().config) | ||
# list explicitely defined datasets (also works with destinations, sources, pipelines etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# list explicitely defined datasets (also works with destinations, sources, pipelines etc.) | |
# list explicitly defined datasets (also works with destinations, sources, pipelines etc.) |
|
||
### Accessing entities | ||
|
||
Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised. | |
Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you specify `allow_undefined_entities: false` in `dlt.yml`, an error will be raised. |
incomplete sentence?
|
||
### Running pipelines with the predefined source with the runner | ||
|
||
`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code. | |
`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the project tutorial. The project context also provides a runner instance that you can use to run your pipelines in code. |
it could be unclear for the reader that's the basic tutorial is. Let's explicitly call it "project tutorial" (or "basic project tutorial"). Also need to link it.
|
||
### Accessing the catalog | ||
|
||
The catalog provides access to all explicitely defined datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catalog provides access to all explicitely defined datasets | |
The catalog provides access to all explicitly defined datasets: |
# for this to work this dataset must already exist physically | ||
dataset = current.catalog().dataset("my_pipeline_dataset") | ||
# get the row counts of all tables in the dataset as a dataframe | ||
print(dataset.row_counts().df()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got an error here:
Traceback (most recent call last):
File ".../hackathon-0228/do_something.py", line 26, in <module>
print(dataset.row_counts().df())
^^^^^^^^^^^^^^^^^^^^^^^^^
File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/dataset/relation.py", line 89, in _wrap
return getattr(cursor, func_name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 360, in df
return next(self.iter_df(chunk_size=chunk_size))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/sql_client.py", line 40, in iter_df
yield self.native_cursor.fetch_df()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'numpy'
# get a dataset from the catalog (this dataset must already exist physically and be explicitely defined in the dlt.yml) | ||
dataset = current.catalog().dataset("my_pipeline_dataset") | ||
# write a dataframe to the dataset into the table "my_table" | ||
dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to tell the reader to install and import Pandas as pd
here, otherwise this sample won't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After installing Pandas and importing it as pd
I get this error when running this line:
Traceback (most recent call last):
File "./hackathon-0228/do_something.py", line 31, in <module>
dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
TypeError: 'ReadableDBAPIRelation' object is not callable
yield df | ||
|
||
# write the dataframe to the dataset into the table "my_new_table" | ||
dataset.write(transform_frames, table_name="my_new_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I'm doing something wrong, because here dataset.write
also gives me an error:
dataset.write(transform_frames, table_name="my_new_table")
TypeError: 'ReadableDBAPIRelation' object is not callable
dlt project init arrow duckdb --package my_dlt_project | ||
``` | ||
|
||
This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed: | |
This will provide you with the same project as used in the basic project tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed: |
└── pyproject.toml # the main project manifest | ||
``` | ||
|
||
Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+: | |
Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The `pyproject.toml` has a special entry point setting that makes the dlt+ project discoverable by dlt+: |
|
||
### Using the packaged project | ||
|
||
Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project: | |
Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with Poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project: |
Good overall structure and tutorial . |
Great experience! Easy to use, everything went smoothly. I love that we can work with Python. At the same time, it feels like we initially tried to avoid Python with YAML, and now we're coming back to it, I personally don’t mind! Just found it funny :) Also, the tutorial is clear and very well written. Every time I had a question, I found the answer in the next section. Some notes and problems described below:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall everything looks good + works, had a few comments on specific sections
:::tip | ||
Read the docs on how to access data in python in dlt datasets [here](../../../general-usage/dataset-access/dataset) to learn more about the available data access methods. You can browse and filter tables and get the data in various formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we don't link to open source pages from dlt+, that's just shooting ourselves in the foot. We could link to "Secure data access and sharing" instead, but we might need some more content on there?
|
||
## Implicit entities | ||
|
||
By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be using the --package
flag at the top here actually? We don't talk about packaging it up until later on in the tutorial.
Note that the destination property is an array of destinations on which this dataset may be materialized. More on that in the next section. | ||
|
||
|
||
## Managing Datasets and Destinations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Violetta mentioned this to you already but I don't think this section (managing datasets and destinations) or the previous one (implicit entities) belongs in this tutorial
dataset_name: my_pipeline_dataset | ||
``` | ||
|
||
If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset `my_pipeline_dataset` does not exist on the bigquery destination, you may allow this by adding `bigquery` to the allowed destinations of `my_pipeline_dataset`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset my_pipeline_dataset does not exist on the bigquery destination, you may allow this by adding bigquery to the allowed destinations of my_pipeline_dataset.
Actually I get:
Dataset my_pipeline_dataset is not available on destination duckdb. Available destinations: ['bigquery'].
I’m guessing in your instructions you meant to change the pipeline destination to bigquery? Also don’t we need to define the bigquery destination in the “destinations:” field as well? Even if I do that it keeps just loading to duckdb 🤔
my_dlt_project.runner().run_pipeline("my_pipeline") | ||
``` | ||
|
||
Now you can run the script within the uv virtual environment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably link to "secure data access and sharing" or something from this section to show how data scientists would use the package finally.
|
||
<Link/> | ||
|
||
This tutorial will introduce you to advanced dlt project features. You will learn how to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tutorial will introduce you to advanced dlt project features. You will learn how to: | |
This tutorial will introduce you to advanced dlt project features. You will: |
In general, the tutorial is pretty straight-forward and understandable. But I did hit a few snags:
|
|
||
if __name__ == "__main__": | ||
# should print "access" as defined in your dlt package | ||
print(my_dlt_project.config().current_profile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be print(current.config().current_profile)
?
Description
This PR adds an advanced documentation page of the dlt projects and packaging features.