docs: add advanced project tutorial #2338

sh-rp · 2025-02-20T12:17:08Z

Description

This PR adds an advanced documentation page of the dlt projects and packaging features.

netlify · 2025-02-20T12:17:24Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`46e180d`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/67b85f4b9f99e50008521955
😎 Deploy Preview	https://deploy-preview-2338--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

adrianbr · 2025-02-24T16:27:53Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+datasets:
+  my_pipeline_dataset:
+    destination:
+        - bigquery


adrianbr · 2025-02-25T16:36:36Z

overall i loved the experience

what stood out for me

Feeling of control: when i work in YAML it feels like i am out of code world and it makes me feel like I give control to a stone tablet. Because the YAML is tightly coupled to the app underneath.But by having python access to the project contents, this actually acts more like a manifest that some people on the team can use, but from where the rest of us pick up the orchestration

Next steps in my head: Figure out how I can get a rudimentary dag from this so I can turn it into a deployment

Questions I have that were not answered during the experience

what is packaged WRT data? any data? just code?
can i get some kind of dependency tree from datasets/pipelines?

hibajamal · 2025-02-26T13:25:09Z

was super easy to use, and the experience (of using the current object. ) is the same as accessing different resources from within a dlt source and making changes to it. so quite self explanatory when using it if one already knows how dlt works - and since it is in python, it is easier to use instead of yaml - for me at least, might be easier for people who are used to editing and creating file structures on yaml. so adding a pipeline/destination was super nice and easy.

unsure of what im doing wrong with the catalog commands - i can access the data objects (iterables) but cant quite load dataframes from it, had to go check if the runner worked - it did! the data was loaded right but had trouble loading the dataframes themselves.
^ i think the storage directory from last time was renamed _data? thankfully was able to navigate that bit too - with all the folders that contained info on the pipeline runs. i liked being able to view it right there / as opposed to either going to the root .dlt folder or simply even running the cmd commands.

was unsure of how to run a particular source (in the case of multiple sources) when running the runner, because i think it runs all sources mentioned in the yaml file. so i had to make those changes in the yaml file - declare a particular source if i only want to run that, and not all of them, and then use the runner. would be easier if the run_pipeline() function could also table a source name alongside a pipeline name.

all in all, super easy experience, specially if we're already used to dlt 🤩

burnash · 2025-02-28T17:16:14Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+## Implicit entities
+
+By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`.


It'd be easier to copy the command for the reader if you use code block for dlt project init arrow duckdb --package my_dlt_project

burnash · 2025-02-28T17:19:30Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+- you're familiar with the [core concepts of dlt](../../reference/explainers/how-dlt-works.md)
+- you have completed the [basic project tutorial](./tutorial.md)
+
+## Implicit entities


Even though "entities" were mentioned in basic tutorial it would be better to elaborate what exactly they are or give some clarification (e.g. sources, destinations)

burnash · 2025-02-28T17:23:47Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+## Implicit entities
+
+By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`.


Let's create an example project again

An improvement would be to explicitly link the previous tutorial e.g. something like "Let's create the same project as in basic tutorial (link) with ..."

burnash · 2025-02-28T17:27:49Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+    dataset_name: my_pipeline_dataset
+```
+
+If you run the pipeline with `dlt pipeline my_pipeline run`, dlt+ will use the source and the destination referenced by the pipeline configuration. Datasets are not defined in this example, so dlt+ will create an implicit dataset with the name as defined in the pipeline configuration. For this example, you do not even need the duckdb destination defined, as this would also be implicitely created for you. 


Consider adding dlt[duckdb] and pyarrow to Prerequisites
Otherwise, when I run dlt pipeline my_pipeline run on a clean venv I get:

First run:

Pipeline execution failed at stage sync with exception: <class 'dlt.common.exceptions.MissingDependencyException'> You must install additional dependencies to run duckdb destination. If you use pip you may do the following: pip install "dlt[duckdb]" Dependencies for specific destinations are available as extras of dlt

After I install dlt[duckdb], and run the pipeline again:

Pipeline execution failed at stage extract when processing package 1740762841.0444548 with exception: <class 'dlt.extract.exceptions.ResourceExtractionError'> In processing pipe items: extraction of resource items in generator items caused an exception: No module named 'pyarrow'

burnash · 2025-02-28T17:31:36Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+## Managing Datasets and Destinations
+
+If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:


Suggested change

If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:

If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to define on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:

burnash · 2025-02-28T17:35:16Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+Note that the destination property is an array of destinations on which this dataset may be materialized. More on that in the next section.
+
+
+## Managing Datasets and Destinations


Suggested change

## Managing Datasets and Destinations

## Managing datasets and destinations

Consider using the sentence case for headers

burnash · 2025-02-28T17:50:10Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+```
+
+:::tip
+If  you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.


Suggested change

If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.

If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in Python, more on that below.

burnash · 2025-02-28T17:53:39Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+## Interacting with the project context in code
+
+You can also interact with your dlt+ project directly in code as opposed to running cli commands. Let's use the same starting point as above by running `dlt project init arrow duckdb`. Now add a new python file in the root of your project called `do_something.py` and add the following code:


the above starting point was dlt project init arrow duckdb --package my_dlt_project so when I run dlt project init arrow duckdb in the same directory I see that new files are added to the current dir and not to my_dlt_project.

so it's best either to instruct the user to clean the directory or to use the same init commands

burnash · 2025-02-28T17:59:33Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+Run the above with `python do_something.py` and see the output.
+
+The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.


Suggested change

The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.

The `current` module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.

Mark current as code if we're talking about what we imported from dlt_plus

burnash · 2025-02-28T18:05:27Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+Run the above with `python do_something.py` and see the output.
+
+The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.


I think it would be more elegant if project() in current.project() was an attribute rather than a function: current.project. I know it's an implementation detail but it is most likely not relevant for the end user. (same for the other functions in current)

burnash · 2025-02-28T18:08:29Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+print(current.project().project_dir)
+# show the project config
+print(current.project().config)
+# list explicitely defined datasets (also works with destinations, sources, pipelines etc.)


Suggested change

# list explicitely defined datasets (also works with destinations, sources, pipelines etc.)

# list explicitly defined datasets (also works with destinations, sources, pipelines etc.)

burnash · 2025-02-28T18:11:14Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+### Accessing entities
+
+Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised.


Suggested change

Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised.

Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you specify `allow_undefined_entities: false` in `dlt.yml`, an error will be raised.

incomplete sentence?

burnash · 2025-02-28T18:17:13Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+### Running pipelines with the predefined source with the runner
+
+`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code. 


Suggested change

`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code.

`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the project tutorial. The project context also provides a runner instance that you can use to run your pipelines in code.

it could be unclear for the reader that's the basic tutorial is. Let's explicitly call it "project tutorial" (or "basic project tutorial"). Also need to link it.

burnash · 2025-02-28T18:19:19Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+### Accessing the catalog
+
+The catalog provides access to all explicitely defined datasets


Suggested change

The catalog provides access to all explicitely defined datasets

The catalog provides access to all explicitly defined datasets:

burnash · 2025-02-28T18:29:58Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+# for this to work this dataset must already exist physically
+dataset = current.catalog().dataset("my_pipeline_dataset")
+# get the row counts of all tables in the dataset as a dataframe
+print(dataset.row_counts().df())


Got an error here:

Traceback (most recent call last): File ".../hackathon-0228/do_something.py", line 26, in <module> print(dataset.row_counts().df()) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/dataset/relation.py", line 89, in _wrap return getattr(cursor, func_name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 360, in df return next(self.iter_df(chunk_size=chunk_size)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "./hackathon-0228/.venv/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/sql_client.py", line 40, in iter_df yield self.native_cursor.fetch_df() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'numpy'

burnash · 2025-02-28T18:36:48Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+# get a dataset from the catalog (this dataset must already exist physically and be explicitely defined in the dlt.yml)
+dataset = current.catalog().dataset("my_pipeline_dataset")
+# write a dataframe to the dataset into the table "my_table"
+dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")


We need to tell the reader to install and import Pandas as pd here, otherwise this sample won't work

After installing Pandas and importing it as pd I get this error when running this line:

Traceback (most recent call last): File "./hackathon-0228/do_something.py", line 31, in <module> dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table") TypeError: 'ReadableDBAPIRelation' object is not callable

burnash · 2025-02-28T19:56:59Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+        yield df
+
+# write the dataframe to the dataset into the table "my_new_table"
+dataset.write(transform_frames, table_name="my_new_table")


Apparently I'm doing something wrong, because here dataset.write also gives me an error:

dataset.write(transform_frames, table_name="my_new_table") TypeError: 'ReadableDBAPIRelation' object is not callable

burnash · 2025-02-28T20:00:02Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+dlt project init arrow duckdb --package my_dlt_project
+```
+
+This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:


Suggested change

This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:

This will provide you with the same project as used in the basic project tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:

burnash · 2025-02-28T20:05:55Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+└── pyproject.toml        # the main project manifest
+```
+
+Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+:


Suggested change

Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+:

Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The `pyproject.toml` has a special entry point setting that makes the dlt+ project discoverable by dlt+:

burnash · 2025-02-28T20:08:32Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+### Using the packaged project
+
+Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:


Suggested change

Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:

Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with Poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:

burnash · 2025-02-28T20:19:23Z

Good overall structure and tutorial . dataset.write for some reason didn't work for me. I would also extend the packaged project section with examples of accessing data, this could be useful in the Notebook's contexts.

AstrakhantsevaAA · 2025-02-28T22:10:35Z

Great experience! Easy to use, everything went smoothly. I love that we can work with Python. At the same time, it feels like we initially tried to avoid Python with YAML, and now we're coming back to it, I personally don’t mind! Just found it funny :)

Also, the tutorial is clear and very well written. Every time I had a question, I found the answer in the next section.

Some notes and problems described below:

Add a CLI example here because we didn’t use datasets with a destination in the basic tutorial. This might be confusing.

tip
If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.
Mention here that the user should run this Python script from the same folder where dlt.yml is located.

Run the above with python do_something.py and see the output.
I misunderstood this line: pipeline = current.entities().create_pipeline("my_pipeline"). I expected the new pipeline to be created in dlt.yml. I’m probably not the only one who will think this way in the future, hehe.
When I run the pipeline using runner.run_pipeline("my_pipeline"), it executes with all destinations in the list: first DuckDB, then BigQuery. Is this the expected behavior? Can we add this to the documentation? Because from the quote below, it feels like only the first destination should be used:

If you run dataset CLI commands without providing a destination name, dlt+ will always select the first destination in the list by default.

I'm getting an error when using dataset.write:

File "/Users/alena/dlthub/temp/tutorial/temp.py", line 8, in <module>
  dataset.write(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
TypeError: 'ReadableDBAPIRelation' object is not callable

I found the save method (dlt_plus.destinations.dataset.WritableDataset.save) in dataset, and it seems to work.

Running dlt project init arrow duckdb --package my_dlt_project gives me this error:

Usage: dlt [-h] [--version] [--disable-telemetry] [--enable-telemetry] [--non-interactive] [--debug]
           {transformation,source,project,profile,pipeline,mcp,license,destination,dbt,dataset,cache,telemetry,schema,init,render-docs,deploy} ...
dlt: error: unrecognized arguments: my_dlt_project

I’m not sure how to make it work. I’ve tried different combinations, e.g:

(temp) temp ❯ dlt project init arrow duckdb --package 
ERROR: Package creation is not implemented yet, please omit the --package flag.
NOTE: Please refer to our docs at 'https://dlthub.com/docs/intro' for further assistance.

akelad

Overall everything looks good + works, had a few comments on specific sections

akelad · 2025-03-03T10:56:38Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+:::tip
+Read the docs on how to access data in python in dlt datasets [here](../../../general-usage/dataset-access/dataset) to learn more about the available data access methods. You can browse and filter tables and get the data in various formats.


I'd suggest we don't link to open source pages from dlt+, that's just shooting ourselves in the foot. We could link to "Secure data access and sharing" instead, but we might need some more content on there?

akelad · 2025-03-03T10:58:10Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+## Implicit entities
+
+By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`.


Do we need to be using the --package flag at the top here actually? We don't talk about packaging it up until later on in the tutorial.

akelad · 2025-03-03T13:29:48Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+Note that the destination property is an array of destinations on which this dataset may be materialized. More on that in the next section.
+
+
+## Managing Datasets and Destinations


I think Violetta mentioned this to you already but I don't think this section (managing datasets and destinations) or the previous one (implicit entities) belongs in this tutorial

akelad · 2025-03-03T13:36:01Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+    dataset_name: my_pipeline_dataset
+```
+
+If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset `my_pipeline_dataset` does not exist on the bigquery destination, you may allow this by adding `bigquery` to the allowed destinations of `my_pipeline_dataset`.


If you try to run the pipeline "my_pipeline" now, you will receive an error that the dataset my_pipeline_dataset does not exist on the bigquery destination, you may allow this by adding bigquery to the allowed destinations of my_pipeline_dataset.

Actually I get:

Dataset my_pipeline_dataset is not available on destination duckdb. Available destinations: ['bigquery'].

I’m guessing in your instructions you meant to change the pipeline destination to bigquery? Also don’t we need to define the bigquery destination in the “destinations:” field as well? Even if I do that it keeps just loading to duckdb 🤔

akelad · 2025-03-03T13:36:46Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+    my_dlt_project.runner().run_pipeline("my_pipeline")
+```
+
+Now you can run the script within the uv virtual environment:


We should probably link to "secure data access and sharing" or something from this section to show how data scientists would use the package finally.

rahuljo · 2025-03-03T15:27:08Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+<Link/>
+
+This tutorial will introduce you to advanced dlt project features. You will learn how to:


Suggested change

This tutorial will introduce you to advanced dlt project features. You will learn how to:

This tutorial will introduce you to advanced dlt project features. You will:

rahuljo · 2025-03-03T15:52:20Z

In general, the tutorial is pretty straight-forward and understandable. But I did hit a few snags:

When I tried to run the following lines:

# get a dataset instance pointing to the default destination (first in dataset destinations list) and access data inside of it
# for this to work this dataset must already exist physically
dataset = current.catalog().dataset("my_pipeline_dataset")
# get the row counts of all tables in the dataset as a dataframe
print(dataset.row_counts().df())

I got this error:

zsh: segmentation fault  python

I even got this error when running the CLI command:

dlt dataset my_pipeline_dataset row-counts

What finally fixed it was installing pandas into my environment. I only tried this because of one of Anton's comments, and the error by itself was not helpful.

I also didn't fully understand the section "Accessing entities". It reads like you can create new pipelines using the project context in code but it did not work for me. I tried to create a new pipeline my_pipeline_2:

# get a pipeline instance
pipeline = current.entities().create_pipeline("my_pipeline_2")
# get a destination instance
destination = current.entities().create_destination("duckdb")

and I got the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/rahul/Desktop/dlt_project_minihackathon/env/lib/python3.11/site-packages/dlt_plus/project/entity_factory.py", line 178, in create_pipeline
    raise ProjectException(
dlt_plus.project.exceptions.ProjectException: Destination is not defined for pipeline 'my_pipeline_2'

I couldn't get the pip import part to work either. I followed the tutorial but I kept hitting the following error when doing uv run python test_project.py:

Traceback (most recent call last):
File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/test_project.py", line 2, in <module>
    from my_dlt_project import current
ImportError: cannot import name 'current' from 'my_dlt_project' (/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py)
rahul@dlthubs-MBP dlt_project_minihackathon_import_directory % uv run python test_project.py
Traceback (most recent call last):
File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/test_project.py", line 7, in <module>
    print(my_dlt_project.config().current_profile)
        ^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py", line 27, in config
    return context().project
        ^^^^^^^^^
File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project/__init__.py", line 21, in context
    return ensure_project(run_dir=os.path.dirname(__file__), profile=access_profile())
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/dlt_plus/project/run_context.py", line 383, in ensure_project
    raise ProjectRunContextNotAvailable(run_dir)
dlt_plus.project.exceptions.ProjectRunContextNotAvailable: Path /Users/rahul/Desktop/dlt_project_minihackathon_import_directory/.venv/lib/python3.11/site-packages/my_dlt_project does not belong to dlt project.
* it does not contain dlt.yml
* none of parent folders contains dlt.yml
* it does not contain pyproject.toml which defines a python module with dlt.yml in the root folder
* it does not contain pyproject.toml which explicitly defines dlt project with `dlt_project` entry point
Please refer to dlt+ documentation for details.

rahuljo · 2025-03-03T15:57:52Z

docs/website/docs/plus/getting-started/advanced_tutorial.md

+
+if __name__ == "__main__":
+    # should print "access" as defined in your dlt package
+    print(my_dlt_project.config().current_profile)


Is this supposed to be print(current.config().current_profile)?

sh-rp added the documentation Improvements or additions to documentation label Feb 20, 2025

sh-rp changed the title ~~docs: add page skeleton~~ docs: add advanced project tutorial Feb 20, 2025

sh-rp force-pushed the docs/advanced_projects_tutorial branch from a611425 to dd6923e Compare February 20, 2025 14:48

add advanced docs

a4f149e

sh-rp force-pushed the docs/advanced_projects_tutorial branch from dd6923e to a4f149e Compare February 20, 2025 15:24

fix linting

46e180d

sh-rp linked an issue Feb 24, 2025 that may be closed by this pull request

Create a tutorial on how to create a Python package of dlt+ Project for data access #2323

Open

adrianbr reviewed Feb 24, 2025

View reviewed changes

docs/website/docs/plus/getting-started/advanced_tutorial.md

datasets:

my_pipeline_dataset:

destination:

- bigquery

Copy link

Contributor

adrianbr Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb

burnash reviewed Feb 28, 2025

View reviewed changes

akelad reviewed Mar 3, 2025

View reviewed changes

rahuljo reviewed Mar 3, 2025

View reviewed changes


		## Implicit entities

		By default, dlt+ will create implicit entities when they are requested by the user or the executed code. Let's create an example project again with `dlt project init arrow duckdb --package my_dlt_project`.


		## Managing Datasets and Destinations

		If you add explicit dataset entries to the `dlt.yml` file, as we did above, you need to defined on which destinations the dataset may be materialized. This constraint will also be enforced when `allow_undefined_entities` is set to `true`. Change your dlt.yml file to the following:

		Note that the destination property is an array of destinations on which this dataset may be materialized. More on that in the next section.


		## Managing Datasets and Destinations

	If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in python, ore on that below.
	If you run dataset cli commands without providing a destination name, dlt+ will always select the first destination in the list by default. If you allow implicit entities, dlt+ can also discover datasets only defined on pipelines and will use the destination of that pipeline for instantiating the dataset. The same is true when retrieving datasets from the catalog in Python, more on that below.


		## Interacting with the project context in code

		You can also interact with your dlt+ project directly in code as opposed to running cli commands. Let's use the same starting point as above by running `dlt project init arrow duckdb`. Now add a new python file in the root of your project called `do_something.py` and add the following code:


		Run the above with `python do_something.py` and see the output.

		The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.

	The current module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.
	The `current` module will allow you to interact with the currently active project. You can retrieve the project configuration with `current.project()`, get an instantiatend version of all defined entities from the entities factory at `current.entities()`, access all defined datasets in the catalog: `current.catalog()`, and run pipelines with the runner at `current.runner()`.

	# list explicitely defined datasets (also works with destinations, sources, pipelines etc.)
	# list explicitly defined datasets (also works with destinations, sources, pipelines etc.)


		### Accessing entities

		Accessing entities works in the same way as when accessing entities in the `dlt.yml`. If possible and allowed, implicit entities are created and returned back to you, if you, an error will be raised.


		### Running pipelines with the predefined source with the runner

		`dlt+` provides a pipeline runner that is used when you run pipelines from the cli as described in the basic tutorial. The project context also provides a runner instance that you can use to run your pipelines in code.


		### Accessing the catalog

		The catalog provides access to all explicitely defined datasets

	This will provide you with the same project as used in the basic tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:
	This will provide you with the same project as used in the basic project tutorial but places in a module directory named `my_dlt_project` and provided with a simple PEP compatible `pyproject.toml` file. You will also receive a default package `__init__.py` file to make the module accssible from the outside when installed:

	Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The pyproject.toml has a special entry point setting that makes the dlt+ project discoverable by dlt+:
	Your `dlt.yml` will look and work exactly the same as in the non-packaged version. The only difference is the module structure and the presence of the `pyproject.toml` file. The `pyproject.toml` has a special entry point setting that makes the dlt+ project discoverable by dlt+:


		### Using the packaged project

		Using this project in its default state, let's simulate installing it into a new python project like a data-scientist in your organization might do, to demonstrate how this could be used in practice. For this example we use the uv package manager, but this is also possible with poetry or pip. Let's assume you have been developing your dlt+ packaged project at `/Volumes/my_drive/my_folder/pyproject.toml`. Navigate to a new directory and initialize your project:

		:::tip
		Read the docs on how to access data in python in dlt datasets [here](../../../general-usage/dataset-access/dataset) to learn more about the available data access methods. You can browse and filter tables and get the data in various formats.


		<Link/>

		This tutorial will introduce you to advanced dlt project features. You will learn how to:

docs: add advanced project tutorial #2338

Are you sure you want to change the base?

docs: add advanced project tutorial #2338

Conversation

sh-rp commented Feb 20, 2025

Description

netlify bot commented Feb 20, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

adrianbr commented Feb 25, 2025

hibajamal commented Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burnash Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burnash Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burnash Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burnash commented Feb 28, 2025

AstrakhantsevaAA commented Feb 28, 2025 • edited Loading

akelad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akelad Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahuljo commented Mar 3, 2025

Choose a reason for hiding this comment

netlify bot commented Feb 20, 2025 •

edited

Loading

hibajamal commented Feb 26, 2025 •

edited

Loading

burnash Feb 28, 2025 •

edited

Loading

burnash Feb 28, 2025 •

edited

Loading

burnash Feb 28, 2025 •

edited

Loading

AstrakhantsevaAA commented Feb 28, 2025 •

edited

Loading

akelad Mar 3, 2025 •

edited

Loading