Skip to content

Releases: dlt-hub/dlt

1.5.0

17 Dec 18:49
e8c5e9b
Compare
Choose a tag to compare

Core Library

After several weeks of experimenting we release dataset API. You can now read data in your destination with a neat, unified interface that works the same way for warehouses, relational databases, sql alchemy dialects, local and remote files, iceberg and delta tables.
You can use simple dot notation to access tables, execute sql or use data-frame expression (compiled to SQL with ibis). We materialize your data as panda frames, arrow tables or dbapi compatible records (also in batches). Here's main intro:
https://dlthub.com/docs/general-usage/dataset-access/dataset

Together with this we release our backend-less, catalog-less (well, ad hoc technical catalog is created) Iceberg implementation. You can use append and replace write dispositions, create partitions and write to the bucket. Be aware of limitations, we are just starting!
https://dlthub.com/docs/dlt-ecosystem/destinations/delta-iceberg

  • bump semver to minimum version 3.0.0 by @sh-rp in #2132
  • leverage ibis expression for getting readablerelations by @sh-rp in #2046
  • iceberg table format support for filesystem destination by @jorritsandbrink in #2067
  • fixes dlt init fails in Colab (userdata problem) by @rudolfix in #2117
  • Add open/closed range arguments for incremental by @steinitzu in #1991
  • Fix validation error in for custom auth classes by @burnash in #2129
  • add databricks oauth authentication by @donotpush in #2138
  • make duckdb handle Iceberg table with nested types by @jorritsandbrink in #2141
  • refresh standalone resources (old columns were recreated) by @rudolfix in #2140
  • fix ibis az problems on linux by @sh-rp in #2135
  • does not raise if data type was changed manually in schema by @rudolfix in #2150
  • allows to --eject source code of the core sources (ie. sql_database) to allow hacking-in customizations by @rudolfix in #2150
  • convert add_limit to pipe step based limiting by @sh-rp in #2131
  • Enable datatime format for negative timezone by @hairrrrr in #2155

ℹ️ Note on add_limit: now you can use it to chunk large resources and load them in pieces. We support chunks created
based on maximum number of rows or after a specified time. Please read the docs: your resource should return ordered rows or
be able to get data from checkpoint. Also note that we apply add_limit after all processing steps (ie. incremental), before we were limiting generator directly. This was a necessary change to implement chunking and is backward compatible regarding produced data but your resource can be queried many times the get "new" item that ie. is not filtered out by incremental.
https://dlthub.com/docs/examples/backfill_in_chunks

Docs

  • prepare dataset release & docs updates by @sh-rp in #2126
  • Add missing mention of the required endpoint_url config in GCS by @trymzet in #2120
  • example how to use add_limit to do large backfills in steps by @sh-rp in #2131
  • Update auth info in databricks docs by @VioletM in #2153
  • improve how dlt works page by @sh-rp in #2152
  • explicitly adding docs for destination item size control by @HulmaNaseer in #2118
  • Docs: rest_api tutorial: update primary key in merge example by @burnash in #2147

Verified Sources

Code got updated to 1.x.x dlt and tests work again. We are accepting contributions again.
ℹ️ 0.5 sources are on 0.5 tag. If you are still on dlt 0.5.x access this tag via dlt init sql_database duckdb --branch 0.5

New Contributors

Full Changelog: 1.4.1...1.5.0

1.4.1

02 Dec 21:54
f069071
Compare
Choose a tag to compare

Bugfixes

We release important bugfix and define identifier normalization behavior for compound identifiers. practically

  • identifiers that contain double underscores will be allowed
  • all existing schemas (ie. stored at destination) will be set to work in backward-compatible mode

read more here: https://dlthub.com/docs/devel/general-usage/naming-convention#compound-flattened-identifiers

  • #2087 allows double underscores in identifiers by @rudolfix in #2098
  • Fixes the usage of escaped JSONPath in incremental cursors in sql_database by @burnash in #2077
  • Fix/2089 support sets for pyarrow backend by @karakanb in #2090
  • allow to increase total count on most progress bars, fixes incorrect output in load stage by @sh-rp in #2100

Core Libary

  • Support custom Ollama Host by @Pipboyguy in #2044
  • feat(rest_api): custom client for specific resources by @joscha in #2082
  • Support Spatial Types for PostGIS by @Pipboyguy in #1927
  • Incremental table hints and incremental in resource decorator by @steinitzu in #2033
  • supports custom account host for azure (#2073 ) and fixes various edge cases for abfss @rudolfix

Core Sources

adds engine adapter and passes incremental and engine to query adapter by @rudolfix in #2070

  • adds engine adapter callback to modify engine settings before connection is opened (hopefully fixes #1920)
  • allows to return a subquery for table adapter, adds example that fixes #2076
  • passes Incremental and Engine instances to query adapter callback
  • allows to return text query from engine adapter #1997
  • arrow backend now infers not reflected columns from the data

(Still) experimental interfaces

  • allow to select schema from pipeline dataset factory by @sh-rp in #2075
  • ibis support - hand over credentials to ibis backend for a number of destinations by @sh-rp in #2004

Docs

New Contributors

Full Changelog: 1.4.0...1,4,1

1.4.0

14 Nov 21:15
0fce1c8
Compare
Choose a tag to compare

Core Library

  • feat: add incremental lag (attribution window) for datetime, int, and float cursors by @donotpush in #1957
  • LanceDB - (1) support merge key to merge chunked documents correctly - removes orphaned chunks (2) huge performance upgrade by loading data via arrow by @Pipboyguy in #1620
  • Move exclude_keys() to dlt.common.utils by @burnash in #1966
  • Fix BigQueryLoadJob hiding root cause exception by @xneg in #1992
  • loads secrets from colab userdata and steamlit + bugfixes by @rudolfix in #1994
  • Fix pagination issue in JSONResponseCursorPaginator with empty string cursor value by @kang8 in #2016
  • fix: if name of distribution is None by @senickel in #2024
  • allows to pass default values when writing specs by @rudolfix in #2018
  • enable delta partitioning on arrow normalizer load id by @jorritsandbrink in #2022
  • add session token to duckdb s3 secret by @jorritsandbrink in #2007
  • Add user agent for Databricks by @VioletM in #1987
  • Fix an incorrect missing dependency error by @burnash in #2001
  • fix resource level max_table_nesting and normalizer performance tuning by @sh-rp in #2026
  • move default pipelines of cores sources into source folders by @sh-rp in #1888
  • duckdb filesystem custom secrets by @sh-rp in #2017
  • allows for empty dataset clickhouse by @rudolfix in #2045
  • add GCP default credential handling for delta table format by @jorritsandbrink in #2048
  • enables merges for bigquery autodetect schema by @sh-rp in #2035
  • logs warning if deduplication state is large by @willi-mueller in #1877
  • Add core sources extras to requirements in dlt init by @burnash in #2028
  • Fix merge write disposition for pyarrow and ClickHouse by @burnash in #2042

Experimental interfaces

dlt dataset public interface and docs coming next week.

  • 1990 - dataset columns select and limit by @sh-rp in #2000

Docs

New Contributors

Full Changelog: 1.3.0...1.4.0

1.3.0

22 Oct 08:53
1893860
Compare
Choose a tag to compare

Core Library

  • Fix try/except in from_reference shadowing MissingDependencyException by @burnash in #1939
  • prefers uv over pip if found (when creating virtual envs) by @rudolfix in #1940
  • allows to plug new or updated dlt cli commands by @sh-rp in #1938
  • Feat/557 rest api add oauth2clientcredentials to built in auth methods by @willi-mueller in #1871
  • uses path normalize for columns in arrow tables by @rudolfix in #1947
  • Added extended jsonpath_ng parser (rest_api) by @francescomucio in #1941
  • Fix/1897 support https endpoints clickhouse by @sh-rp in #1931
  • Fix for multiple ignores is not working (rest_api) by @burnash in #1956
  • SQL Database: Support including/excluding NULL cursor values by @steinitzu in #1946
  • Add references table hint and reflect them in sql_database by @steinitzu in #1925
  • only truncate or delete from existing tables in refresh modes by @sh-rp in #1926
  • adds bigquery partition expiration and motherduck connection string by @rudolfix in #1968

Experimental interfaces

Below we expose a new pipeline._dataset and dlt._dataset interfaces that provide unified access to data loaded into destination. We also implement duckdb-based SQL client on a filesystem destination to access data in data lakes. We'll add documentation once we stabilize dataset interface. However already now you can benefit from new cursor implementation of sql_client that allows to take data frames, arrow tables also in batches:

  • dataset factory by @sh-rp in #1945
  • expose readable datasets as dataframes and arrow tables by @sh-rp in #1507

PRs below adds pluggy and a few first plugin hooks. The idea is to make a lot of functionalities in dlt pluggable. Currently you can plug new cli command (or upgrade existing) and you can also plug your own runtime environment (how dlt looks for data, secrets etc.)

Docs

New Contributors

Full Changelog: 1.2.0...1.3.0

1.2.0

07 Oct 21:10
8798c17
Compare
Choose a tag to compare

Core Library

Docs

New Contributors

Full Changelog: 1.1.0...1.2.0

1.1.0

26 Sep 13:33
d2b6d05
Compare
Choose a tag to compare

What's Changed

Docs

Verified Sources

  • Custom filter clauses supported, pyarrow/arrowmongo requirement optional for Mongo by @Pipboyguy

New Contributors

Full Changelog: 1.0.0...1.1.0

1.0.0

16 Sep 15:07
Compare
Choose a tag to compare

This is a major dlt release. Please check the list of breaking changes and deprecations: #1778

Core Library

  • move rest_api, sql_database and filesystem sources to dlt core by @willi-mueller in #1728
  • drops foreign_key, adds nested references (row_key - parent_key) by @rudolfix in #1774
  • deprecates complex data type, changes to json by @rudolfix in #1792
  • Feat/1749 abort load package and raise exception on terminal errors in jobs by @willi-mueller in #1781
  • Feat/1492 extend timestamp config to handle naive timestamps (without timezone) by @donotpush in #1669
  • Fix/1571 Incremental: Optionally load or ignore/exclude/include records with cursor_path missing or None value by @willi-mueller in #1576
  • creates a single source in extract for all resource instances passed as list by @rudolfix in #1535
  • Enable BigQuery schema auto-detection with partitioning and clustering hints by @Pipboyguy in #1806
  • Sqlalchemy destination (merge support and docs still in progress) by @steinitzu in #1734
  • Feat/1730 extend filesystem sftp by @donotpush in #1769
  • Stops dumping secrets to dlt traces. by @willi-mueller in #1797
  • Don't use Custom Embedding Functions on LanceDB by @Pipboyguy in #1771
  • sets default concurrency for blob upload for adlfs to 1 to avoid massive memory usage on large files by @rudolfix in #1779
  • Fix/1790 support incremental load with arrow when cursor column is not nullable by @willi-mueller in #1791
  • controls row group size and empty tables in memory buffer when writing parquet by @rudolfix in #1782
  • fix installation command" by @novica in #1741
  • skips tables without jobs when merging delta tables by @rudolfix in #1803

Docs

New Contributors

Full Changelog: 0.5.4...1.0.0

0.5.4

28 Aug 20:02
9857029
Compare
Choose a tag to compare

Core Library

Docs:

New Contributors

Full Changelog: 0.5.3...0.5.4

0.5.3

13 Aug 00:20
19c41ea
Compare
Choose a tag to compare

Core Library

  • Add support for continuously starting load jobs as slots free up in the loader. This will significantly speed up loading packages with many files. by @sh-rp in #1494
  • Add get_delta_tables helper function to optimize and vacuum tables by @jorritsandbrink in #1664
  • Raise/warn on incomplete columns in normalize by @steinitzu in #1504
  • Add enable_dataset_name_normalization option by @VioletM in #1676
  • updates duckdb/motherduck load job to match parquet by column names by @rudolfix in #1674
  • updates duckdb/motherduck load job to fully allow jsonl file format by @rudolfix in #1674
  • removes internal locks when loading parquet from multiple threads (duckdb got fixed) #1674
  • enables multi transactions statements for Motherduck #1674
  • fixes dbt logs line endings

Docs

Verified Sources

  • Column selector added to sql_database @steinitzu

New Contributors

Full Changelog: 0.5.2...0.5.3

0.5.2

02 Aug 19:18
e00baa0
Compare
Choose a tag to compare

Core Library

  • Add upsert merge strategy for Postgres and Snowflake, by @jorritsandbrink in #1466
  • Add basic upsert support for delta table format in filesystem destination by @jorritsandbrink in #1600
  • query tagging for snowflake by @rudolfix in #1582
  • Support Open Source ClickHouse Deployments (MergeTree engine and more) by @Pipboyguy in #1496
  • allows nested types in BigQuery via native autodetect_schema by @rudolfix in #1591
  • Enable upsert merge strategy for more SQL destinations (Athena, BigQuery, Databricks, mssql) by @jorritsandbrink in #1628
  • Fix/1512 fixes current.pipeline() access by @rudolfix in #1581
  • feat: add config dataset_name_prefix to set custom staging dataset name by @donotpush in #1563
  • fix: add airflow db reset for all tests by @donotpush in #1559
  • Enable S3 compatible storage for delta table format by @jorritsandbrink in #1586
  • feat/1495 rest_client: renames JSONResponsePaginator to JSONLinkPaginator by @willi-mueller in #1558
  • Feat/1596 adds custom config providers + example of yaml config provider supporting profiles and jinja placeholders by @rudolfix in #1642
  • Feat/1583 rest client session timeout configuration by @willi-mueller in #1590
  • Add clarification for add_limit by @VioletM in #1594
  • Fix/1606 fixes validator incremental step order to keep it always last in the pipe by @rudolfix in #1641
  • Feat/1593 rest_client: allow setting of request kwargs by @willi-mueller in #1609
  • prevent accidental wrapping of sources in resources when using adapters by @sh-rp in #1645
  • Add empty source handling for delta table format on filesystem destination by @jorritsandbrink in #1617
  • Surface original err msg from pydantic as extended_info on DataValidationError by @codingcyclist in #1569
  • fix(dockerfile): remove extra spaces around equals sign in LABEL inst… by @thisisdope in #1573
  • Qdrant uncommitted state restore and test by @steinitzu in #1545
  • fix: suppress alembic logs for tests by @donotpush in #1578

Docs

New Contributors

Full Changelog: 0.5.1...0.5.2