Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add advanced project tutorial #2338

Draft
wants to merge 2 commits into
base: devel
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions docs/website/docs/plus/getting-started/advanced_tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
---
title: Advanced Project tutorial
description: Using the dlt+ cli commands to create and manage dlt+ Project
keywords: [command line interface, cli, dlt init, dlt+, project]
---

import Link from '../../_plus_admonition.md';

<Link/>

This tutorial will introduce you to advanced dlt project features. You will learn how to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This tutorial will introduce you to advanced dlt project features. You will learn how to:
This tutorial will introduce you to advanced dlt project features. You will:


* Understand how dlt+ creates implicit dlt entites if they are not defined and how to prevent this
* Learn how to use dlt+ datasets to restrict access to certain datasets and destinations
* Learn how to interact with the dlt+ project context and entities from code
* Create a packaged dlt+ project which you can distribute to other stakeholders in your company

## Prerequisites

To follow this tutorial, make sure:

- dlt+ is set up according to the [installation guide](./installation.md)
- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md)
- you have completed the [basic project tutorial](./tutorial.md)

## Implicit entities
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though "entities" were mentioned in basic tutorial it would be better to elaborate what exactly they are or give some clarification (e.g. sources, destinations)


By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be easier to copy the command for the reader if you use code block for dlt project init arrow duckdb --package my_dlt_project

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create an example project again

An improvement would be to explicitly link the previous tutorial e.g. something like "Let's create the same project as in basic tutorial (link) with ..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be using the --package flag at the top here actually? We don't talk about packaging it up until later on in the tutorial.


```yaml
# your sources are the data sources you want to load from
sources:
arrow:
type: sources.arrow.source

# your destinations are the databases where your data will be saved
destinations:
duckdb:
type: duckdb

# your pipelines orchestrate data loading actions
pipelines:
my_pipeline:
source: arrow
destination: duckdb
dataset_name: my_pipeline_dataset
```

If you run the pipeline with `dlt pipeline my_pipeline run`, dlt+ will use the source and the destination referenced by the pipeline configuration. Datasets are not defined in this example, so dlt+ will create an implicit dataset with the name as defined in the pipeline configuration. For this example, you do not even need the duckdb destination defined, as this would also be implicitely created for you.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding dlt[duckdb] and pyarrow to Prerequisites
Otherwise, when I run dlt pipeline my_pipeline run on a clean venv I get:

First run:

Pipeline execution failed at stage sync with exception:

<class 'dlt.common.exceptions.MissingDependencyException'>

You must install additional dependencies to run duckdb destination. If you use pip you may do the following:

pip install "dlt[duckdb]"

Dependencies for specific destinations are available as extras of dlt

After I install dlt[duckdb], and run the pipeline again:

Pipeline execution failed at stage extract when processing package 1740762841.0444548 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe items: extraction of resource items in generator items caused an exception: No module named 'pyarrow'


If you want to prevent dlt+ from creating implicit entities, you can set the `allow_undefined_entities` option to `false` in the project configuration by adding the following to your `dlt.yml` file:

```yaml
project:
allow_undefined_entities: false
```

If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset `my_pipeline_dataset` does not exist. To fix this, you can create the dataset explicitly in the `dlt.yml` file:

```yaml
datasets:
my_pipeline_dataset:
destination:
- duckdb
```

Note that the destination property is an array of destinations on which this dataset may be materialized. More on that in the next section.


## Managing Datasets and Destinations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Managing Datasets and Destinations
## Managing datasets and destinations

Consider using the sentence case for headers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Violetta mentioned this to you already but I don't think this section (managing datasets and destinations) or the previous one (implicit entities) belongs in this tutorial


If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:
If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to define on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:



```yaml
# your sources are the data sources you want to load from
sources:
arrow:
type: sources.arrow.source

# your destinations are the databases where your data will be saved
destinations:
duckdb:
type: duckdb

datasets:
my_pipeline_dataset:
destination:
- bigquery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb


# your pipelines orchestrate data loading actions
pipelines:
my_pipeline:
source: arrow
destination: duckdb
dataset_name: my_pipeline_dataset
```

If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset `my_pipeline_dataset` does not exist on the bigquery destination, you may allow this by adding `bigquery` to the allowed destinations of `my_pipeline_dataset`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset my_pipeline_dataset does not exist on the bigquery destination, you may allow this by adding bigquery to the allowed destinations of my_pipeline_dataset.

Actually I get:

Dataset my_pipeline_dataset is not available on destination duckdb. Available destinations: ['bigquery'].

I’m guessing in your instructions you meant to change the pipeline destination to bigquery? Also don’t we need to define the bigquery destination in the “destinations:” field as well? Even if I do that it keeps just loading to duckdb 🤔


```yaml
my_pipeline_dataset:
destination:
- duckdb
- bigquery
```

:::tip
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in Python, more on that below.

:::


## Interacting with the project context in code

You can also interact with your dlt+ project directly in code as opposed to running cli commands. Let's use the same starting point as above by running `dlt project init arrow duckdb`. Now add a new python file in the root of your project called `do_something.py` and add the following code:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above starting point was dlt project init arrow duckdb --package my_dlt_project so when I run dlt project init arrow duckdb in the same directory I see that new files are added to the current dir and not to my_dlt_project.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's best either to instruct the user to clean the directory or to use the same init commands


```py
from dlt_plus import current

if __name__ == "__main__":
# this will get the currently active project and print its name
print(current.project().name)
```

Run the above with `python do_something.py` and see the output.

The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.
The `current` module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.

Mark current as code if we're talking about what we imported from dlt_plus

Copy link
Collaborator

@burnash burnash Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more elegant if project() in current.project() was an attribute rather than a function: current.project. I know it's an implementation detail but it is most likely not relevant for the end user. (same for the other functions in current)


### Accessing project settings

A few examples of what you have access too on the project object:

```py
# show the currently active profile
print(current.project().current_profile)
# show the main project dir
print(current.project().project_dir)
# show the project config
print(current.project().config)
# list explicitely defined datasets (also works with destinations, sources, pipelines etc.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# list explicitely defined datasets (also works with destinations, sources, pipelines etc.)
# list explicitly defined datasets (also works with destinations, sources, pipelines etc.)

print(current.project().datasets)
```

### Accessing entities

Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised.
Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you specify `allow_undefined_entities: false` in `dlt.yml`, an error will be raised.

incomplete sentence?


```py
# get a pipeline instance
pipeline = current.entities().create_pipeline("my_pipeline")
# get a destination instance
destination = current.entities().create_destination("duckdb")
```

### Running pipelines with the predefined source with the runner

`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code.
`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the project tutorial. The project context also provides a runner instance that you can use to run your pipelines in code.

it could be unclear for the reader that's the basic tutorial is. Let's explicitly call it "project tutorial" (or "basic project tutorial"). Also need to link it.


```py
# get the runner
runner = current.runner()
# run the pipeline "my_pipeline" from the currently active project
runner.run_pipeline("my_pipeline")
```

### Accessing the catalog

The catalog provides access to all explicitely defined datasets
Copy link
Collaborator

@burnash burnash Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The catalog provides access to all explicitely defined datasets
The catalog provides access to all explicitly defined datasets:


```py
# get a dataset instance pointing to the default destination (first in dataset destinations list) and access data inside of it
# for this to work this dataset must already exist physically
dataset = current.catalog().dataset("my_pipeline_dataset")
# get the row counts of all tables in the dataset as a dataframe
print(dataset.row_counts().df())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got an error here:

Traceback (most recent call last):
  File ".../hackathon-0228/do_something.py", line 26, in <module>
    print(dataset.row_counts().df())
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/dataset/relation.py", line 89, in _wrap
    return getattr(cursor, func_name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 360, in df
    return next(self.iter_df(chunk_size=chunk_size))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/sql_client.py", line 40, in iter_df
    yield self.native_cursor.fetch_df()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'numpy'

```

:::tip
Read the docs on how to access data in python in dlt datasets [here](../../../general-usage/dataset-access/dataset) to learn more about the available data access methods. You can browse and filter tables and get the data in various formats.
Comment on lines +178 to +179
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we don't link to open source pages from dlt+, that's just shooting ourselves in the foot. We could link to "Secure data access and sharing" instead, but we might need some more content on there?

:::

### Writing data back to the catalog

The datasets on the dlt+ catalogs may also be written to. The datasets provide `write` method that can be used to write data back to the catalog. You will be able to control which datasets may be written to with contracts in the future. Under the hood, `dlt+` will use an ad-hoc dlt pipeline to run this operation.

:::warning
Writing data back to the catalog is an experimental feature at this time and should be used with caution until it is fully stable.
:::

```py
# get a dataset from the catalog (this dataset must already exist physically and be explicitely defined in the dlt.yml)
dataset = current.catalog().dataset("my_pipeline_dataset")
# write a dataframe to the dataset into the table "my_table"
dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to tell the reader to install and import Pandas as pd here, otherwise this sample won't work

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After installing Pandas and importing it as pd I get this error when running this line:

Traceback (most recent call last):
  File "./hackathon-0228/do_something.py", line 31, in <module>
    dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
TypeError: 'ReadableDBAPIRelation' object is not callable

```

You can also iterate over dataframes or arrow tables in an existing table and write them to a new table in the same or another dataset:

```py
# get dataset from the catalog
dataset = current.catalog().dataset("my_pipeline_dataset")

# here we iterate over the existing items table and write each dataframe to the dataset into a new table
def transform_frames():
# iterate of table `items` in the dataset in chunks of 1000 rows
for df in dataset.items.iter_df(chunk_size=1000):
# do something with the dataframe here
yield df

# write the dataframe to the dataset into the table "my_new_table"
dataset.write(transform_frames, table_name="my_new_table")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I'm doing something wrong, because here dataset.write also gives me an error:

dataset.write(transform_frames, table_name="my_new_table")
TypeError: 'ReadableDBAPIRelation' object is not callable

```

### Switching profiles in code

By default if you access the project, you will be using the default or pinned profile. You can switch the profile with the `switch_profile` function. Consider the following example:

```py
from dlt_plus import current
from dlt_plus.project.run_context import switch_profile

if __name__ == "__main__":
# will show that the default profile is active
print(current.project().current_profile)
# switch to the tests profile
switch_profile("tests")
# now the tests profile is active and is merged with the project config
print(current.project().current_profile)
```

## Packaging a project

`dlt+` also has tools to help you package a dlt+ project for distribution. This will make your dlt+ project pip installable and make it easier for you to distribute accross your organization. To start with a dlt+ packaged project, you can supply the package argument to the project init command:

```sh
dlt project init arrow duckdb --package my_dlt_project
```

This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:
Copy link
Collaborator

@burnash burnash Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:
This will provide you with the same project as used in the basic project tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:


```sh
.
├── my_dlt_project/ # your project module
│ ├── __init__.py # your package entry point
│ ├── dlt.yml # the dlt+ project manifest
│ └── ... # more project files
├── .gitignore
└── pyproject.toml # the main project manifest
```

Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+:
Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The `pyproject.toml` has a special entry point setting that makes the dlt+ project discoverable by dlt+:


```toml
[project.entry-points.dlt_package]
dlt-project = "my_project"
```

You can still run the pipeline the same way as before with the cli commands from the root folder:

```sh
dlt pipeline my_pipeline run
```

If you look at the module `__init__.py` file that was created for you, you can see the full interface of your packaged project that the user of your package will have access to. It is very similar to the interface of the `current` interface of the flat project, there is an additional feature that selects the `access` profile by default. It is up to you to adapt this `__init__.py` file to your needs.

### Using the packaged project

Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:
Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with Poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:


```sh
uv init
```

Now we can install our package directly from the directory into the uv virtual environment:

```sh
uv pip install /Volumes/my_drive/my_folder
```

Now your packaged dlt project should be available for you to use. Let's create a new python file names `test_project.py` and use the packaged project:

```py
# import the packaged project
from my_dlt_project import current

if __name__ == "__main__":
# should print "access" as defined in your dlt package
print(my_dlt_project.config().current_profile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be print(current.config().current_profile)?

# now lets run the pipeline in the packaged project
my_dlt_project.runner().run_pipeline("my_pipeline")
```

Now you can run the script within the uv virtual environment:
Copy link
Contributor

@akelad akelad Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably link to "secure data access and sharing" or something from this section to show how data scientists would use the package finally.


```sh
uv run python test_project.py
```


:::info
In a real-life scenario, your data-scientist will probably not install this package from another directory, but from a pypi repository or a git url.
:::


## Next steps
2 changes: 1 addition & 1 deletion docs/website/docs/plus/getting-started/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ After running the command, the following folder structure is created:
│ ├── config.toml
│ ├── dev.secrets.toml
│ └── secrets.toml
├── _data/ # local storage for your project, excluded from git
├── _data/ # local storage for your project, excluded from git
├── sources/ # your sources, contains the code for the arrow source
│ └── arrow.py
├── .gitignore
Expand Down
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ const sidebars = {
items: [
'plus/getting-started/installation',
'plus/getting-started/tutorial',
'plus/getting-started/advanced_tutorial',
]
},
{
Expand Down
Loading