Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the adjust a schema and sql configuration docs #2387

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/postgres.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ To pass credentials directly, use the [explicit instance of the destination](../
pipeline = dlt.pipeline(
pipeline_name='chess',
destination=dlt.destinations.postgres("postgresql://loader:<password>@localhost/dlt_data"),
dataset_name='chess_data'
dataset_name='chess_data' #your destination schema name
)
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ import Header from '../_source-info-header.md';

Read more about sources and resources here: [General usage: source](../../../general-usage/source.md) and [General usage: resource](../../../general-usage/resource.md).

:::note NOTE
To see complete list of source arguments for `sql_database` [refer to the this section](#arguments-for-sql_database-source).
:::

### Example usage:

:::tip
Expand Down Expand Up @@ -344,3 +348,55 @@ print(info)
```
With the dataset above and a local PostgreSQL instance, the `ConnectorX` backend is 2x faster than the `PyArrow` backend.

### Arguments for `sql_database` source
The following arguments can be used with the `sql_database` source:

`credentials` (Union[ConnectionStringCredentials, Engine, str]): Database credentials or an `sqlalchemy.Engine` instance.

`schema` (Optional[str]): Name of the database schema to load (if different from default).

`metadata` (Optional[MetaData]): Optional `sqlalchemy.MetaData` instance. `schema` argument is ignored when this is used.

`table_names` (Optional[List[str]]): A list of table names to load. By default, all tables in the schema are loaded.

`chunk_size` (int): Number of rows yielded in one batch. SQL Alchemy will create additional internal rows buffer twice the chunk size.

`backend` (TableBackend): Type of backend to generate table data. One of: "sqlalchemy", "pyarrow", "pandas" and "connectorx".

- "sqlalchemy" yields batches as lists of Python dictionaries, "pyarrow" and "connectorx" yield batches as arrow tables, "pandas" yields panda frames.

- "sqlalchemy" is the default and does not require additional dependencies,

- "pyarrow" creates stable destination schemas with correct data types,

- "connectorx" is typically the fastest but ignores the "chunk_size" so you must deal with large tables yourself.

`detect_precision_hints` (bool): Deprecated. Use `reflection_level`. Set column precision and scale hints for supported data types in the target schema based on the columns in the source tables. This is disabled by default.

`reflection_level`: (ReflectionLevel): Specifies how much information should be reflected from the source database schema.

- "minimal": Only table names, nullability and primary keys are reflected. Data types are inferred from the data. This is the default option.

- "full": Data types will be reflected on top of "minimal". `dlt` will coerce the data into reflected types if necessary.

- "full_with_precision": Sets precision and scale on supported data types (ie. decimal, text, binary). Creates big and regular integer types.

`defer_table_reflect` (bool): Will connect and reflect table schema only when yielding data. Requires table_names to be explicitly passed.
Enable this option when running on Airflow. Available on dlt 0.4.4 and later.

`table_adapter_callback`: (Callable): Receives each reflected table. May be used to modify the list of columns that will be selected.

`backend_kwargs` (**kwargs): kwargs passed to table backend ie. "conn" is used to pass specialized connection string to connectorx.

`include_views` (bool): Reflect views as well as tables. Note view names included in `table_names` are always included regardless of this setting. This is set to false by default.

`type_adapter_callback`(Optional[Callable]): Callable to override type inference when reflecting columns.
Argument is a single sqlalchemy data type (`TypeEngine` instance) and it should return another sqlalchemy data type, or `None` (type will be inferred from data)

`query_adapter_callback`(Optional[Callable[Select, Table], Select]): Callable to override the SELECT query used to fetch data from the table. The callback receives the sqlalchemy `Select` and corresponding `Table`, 'Incremental` and `Engine` objects and should return the modified `Select` or `Text`.

`resolve_foreign_keys` (bool): Translate foreign keys in the same schema to `references` table hints.
May incur additional database calls as all referenced tables are reflected.

`engine_adapter_callback` (Callable[[Engine], Engine]): Callback to configure, modify and Engine instance that will be used to open a connection ie. to set transaction isolation level.

13 changes: 7 additions & 6 deletions docs/website/docs/walkthroughs/adjust-a-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ schemas
|---export/
```

Rather than providing the paths in the `dlt.pipeline` function, you can also set them
in the `config.toml` file:
Rather than providing the paths in the `dlt.pipeline` function, you can also set them at
the beginning of the `config.toml` file:

```toml
export_schema_path="schemas/export"
Expand Down Expand Up @@ -74,10 +74,11 @@ You should keep the import schema as simple as possible and let `dlt` do the res
In the next steps, we'll experiment a lot; you will be warned to set `dev_mode=True` until we are done experimenting.

:::caution
`dlt` will **not modify** tables after they are created.
So if you have a YAML file, and you change it (e.g., change a data type or add a hint),
then you need to **delete the dataset**
or set `dev_mode=True`:
dlt does **not modify** existing columns in a table after creation. While new columns can be added, changes to existing
columns (such as altering data types or adding hints) will not take effect automatically.

If you modify a YAML schema file, you must either delete the dataset, enable `dev_mode=True`, or use one of the Pipeline
[Refresh options](../general-usage/pipeline#refresh-pipeline-data-and-state) to apply the changes.
```py
dlt.pipeline(
import_schema_path="schemas/import",
Expand Down