Releases: dlt-hub/dlt
1.5.0
Core Library
After several weeks of experimenting we release dataset
API. You can now read data in your destination with a neat, unified interface that works the same way for warehouses, relational databases, sql alchemy dialects, local and remote files, iceberg and delta tables.
You can use simple dot notation to access tables, execute sql or use data-frame expression (compiled to SQL with ibis
). We materialize your data as panda frames, arrow tables or dbapi
compatible records (also in batches). Here's main intro:
https://dlthub.com/docs/general-usage/dataset-access/dataset
Together with this we release our backend-less, catalog-less (well, ad hoc technical catalog is created) Iceberg implementation. You can use append
and replace
write dispositions, create partitions and write to the bucket. Be aware of limitations, we are just starting!
https://dlthub.com/docs/dlt-ecosystem/destinations/delta-iceberg
- bump semver to minimum version 3.0.0 by @sh-rp in #2132
- leverage ibis expression for getting readablerelations by @sh-rp in #2046
iceberg
table format support forfilesystem
destination by @jorritsandbrink in #2067- fixes dlt init fails in Colab (userdata problem) by @rudolfix in #2117
- Add open/closed range arguments for incremental by @steinitzu in #1991
- Fix validation error in for custom auth classes by @burnash in #2129
- add databricks oauth authentication by @donotpush in #2138
- make duckdb handle Iceberg table with nested types by @jorritsandbrink in #2141
- refresh standalone resources (old columns were recreated) by @rudolfix in #2140
- fix ibis az problems on linux by @sh-rp in #2135
- does not raise if data type was changed manually in schema by @rudolfix in #2150
- allows to
--eject
source code of the core sources (ie. sql_database) to allow hacking-in customizations by @rudolfix in #2150 - convert add_limit to pipe step based limiting by @sh-rp in #2131
- Enable datatime format for negative timezone by @hairrrrr in #2155
ℹ️ Note on add_limit
: now you can use it to chunk large resources and load them in pieces. We support chunks created
based on maximum number of rows or after a specified time. Please read the docs: your resource should return ordered rows or
be able to get data from checkpoint. Also note that we apply add_limit
after all processing steps (ie. incremental), before we were limiting generator directly. This was a necessary change to implement chunking and is backward compatible regarding produced data but your resource can be queried many times the get "new" item that ie. is not filtered out by incremental.
https://dlthub.com/docs/examples/backfill_in_chunks
Docs
- prepare dataset release & docs updates by @sh-rp in #2126
- Add missing mention of the required
endpoint_url
config in GCS by @trymzet in #2120 - example how to use
add_limit
to do large backfills in steps by @sh-rp in #2131 - Update auth info in databricks docs by @VioletM in #2153
- improve how dlt works page by @sh-rp in #2152
- explicitly adding docs for destination item size control by @HulmaNaseer in #2118
- Docs: rest_api tutorial: update primary key in merge example by @burnash in #2147
Verified Sources
Code got updated to 1.x.x
dlt
and tests work again. We are accepting contributions again.
ℹ️ 0.5
sources are on 0.5
tag. If you are still on dlt
0.5.x
access this tag via dlt init sql_database duckdb --branch 0.5
New Contributors
- @HulmaNaseer made their first contribution in #2118
- @hairrrrr made their first contribution in #2155
Full Changelog: 1.4.1...1.5.0
1.4.1
Bugfixes
We release important bugfix and define identifier normalization behavior for compound identifiers. practically
- identifiers that contain double underscores will be allowed
- all existing schemas (ie. stored at destination) will be set to work in backward-compatible mode
read more here: https://dlthub.com/docs/devel/general-usage/naming-convention#compound-flattened-identifiers
- #2087 allows double underscores in identifiers by @rudolfix in #2098
- Fixes the usage of escaped JSONPath in incremental cursors in sql_database by @burnash in #2077
- Fix/2089 support sets for pyarrow backend by @karakanb in #2090
- allow to increase total count on most progress bars, fixes incorrect output in load stage by @sh-rp in #2100
Core Libary
- Support custom Ollama Host by @Pipboyguy in #2044
- feat(rest_api): custom client for specific resources by @joscha in #2082
- Support Spatial Types for PostGIS by @Pipboyguy in #1927
- Incremental table hints and incremental in resource decorator by @steinitzu in #2033
- supports custom account host for azure (#2073 ) and fixes various edge cases for abfss @rudolfix
Core Sources
adds engine adapter and passes incremental and engine to query adapter by @rudolfix in #2070
- adds engine adapter callback to modify engine settings before connection is opened (hopefully fixes #1920)
- allows to return a subquery for table adapter, adds example that fixes #2076
- passes Incremental and Engine instances to query adapter callback
- allows to return text query from engine adapter #1997
- arrow backend now infers not reflected columns from the data
(Still) experimental interfaces
- allow to select schema from pipeline dataset factory by @sh-rp in #2075
- ibis support - hand over credentials to ibis backend for a number of destinations by @sh-rp in #2004
Docs
- Updated sql_database documentation for resource usage by @dat-a-man in #2072
- Docs: improve links visibility in light mode by @burnash in #2078
- edit snippet to make more runnable (#2066) by @AstrakhantsevaAA in #2079
- Docs: update deprecated paginator type in examples by @burnash in #2093
- Move "dlt in notebooks" by @AstrakhantsevaAA in #2096
- docs: document that
path
can also be a URL by @joscha in #2099 - Fix minor typo :3 by @jdbohrman in #2103
- 🐛 Fix parquet layout example in the docs by @trymzet in #2105
- docs(rest_client): note about
data_selector
by @joscha in #2101
New Contributors
- @joscha made their first contribution in #2099
- @karakanb made their first contribution in #2090
- @jdbohrman made their first contribution in #2103
- @trymzet made their first contribution in #2105
Full Changelog: 1.4.0...1,4,1
1.4.0
Core Library
- feat: add incremental lag (attribution window) for datetime, int, and float cursors by @donotpush in #1957
- LanceDB - (1) support merge key to merge chunked documents correctly - removes orphaned chunks (2) huge performance upgrade by loading data via arrow by @Pipboyguy in #1620
- Move
exclude_keys()
todlt.common.utils
by @burnash in #1966 - Fix BigQueryLoadJob hiding root cause exception by @xneg in #1992
- loads secrets from colab userdata and steamlit + bugfixes by @rudolfix in #1994
- Fix pagination issue in
JSONResponseCursorPaginator
with empty string cursor value by @kang8 in #2016 - fix: if name of distribution is None by @senickel in #2024
- allows to pass default values when writing specs by @rudolfix in #2018
- enable
delta
partitioning on arrow normalizer load id by @jorritsandbrink in #2022 - add session token to duckdb s3 secret by @jorritsandbrink in #2007
- Add user agent for Databricks by @VioletM in #1987
- Fix an incorrect missing dependency error by @burnash in #2001
- fix resource level max_table_nesting and normalizer performance tuning by @sh-rp in #2026
- move default pipelines of cores sources into source folders by @sh-rp in #1888
- duckdb filesystem custom secrets by @sh-rp in #2017
- allows for empty dataset clickhouse by @rudolfix in #2045
- add GCP default credential handling for
delta
table format by @jorritsandbrink in #2048 - enables merges for bigquery autodetect schema by @sh-rp in #2035
- logs warning if deduplication state is large by @willi-mueller in #1877
- Add core sources extras to requirements in
dlt init
by @burnash in #2028 - Fix merge write disposition for pyarrow and ClickHouse by @burnash in #2042
Experimental interfaces
dlt dataset public interface and docs coming next week.
Docs
- Updated databricks destination documentation by @dat-a-man in #1984
- Docs: fix capitalization of some terms, fix typos by @burnash in #1988
- fix typo by @mariarice15 in #1995
- Fix Zendesk example: make test resilient to data changes by @burnash in #1999
- fix s3 credentials environment variable names by @seunggs in #2010
- remove ga add tm by @alexanderfifefd in #2008
- Super fast snippet linting & type checking by @sh-rp in #2019
- Fix the deprecation warning in
.common.configuration.container
by @burnash in #2025 - Added deploy with modal. by @dat-a-man in #1805
- Updated google cloud function documentation by @dat-a-man in #2034
- add warning for large delta memory footprint on filesystem docs page by @sh-rp in #2036
- simplify advanced section by @kning in #2037
- Added docs on how to deploy a pipeline using Google Cloud run by @dat-a-man in #2038
- Format Delta table section in the filesystem destination by @burnash in #2057
- Docs: add table formats to the sidebar by @burnash in #2060
New Contributors
- @xneg made their first contribution in #1992
- @seunggs made their first contribution in #2010
- @alexanderfifefd made their first contribution in #2008
- @kang8 made their first contribution in #2016
- @senickel made their first contribution in #2024
- @kning made their first contribution in #2037
Full Changelog: 1.3.0...1.4.0
1.3.0
Core Library
- Fix try/except in from_reference shadowing MissingDependencyException by @burnash in #1939
- prefers uv over pip if found (when creating virtual envs) by @rudolfix in #1940
- allows to plug new or updated dlt cli commands by @sh-rp in #1938
- Feat/557 rest api add oauth2clientcredentials to built in auth methods by @willi-mueller in #1871
- uses path normalize for columns in arrow tables by @rudolfix in #1947
- Added extended jsonpath_ng parser (rest_api) by @francescomucio in #1941
- Fix/1897 support https endpoints clickhouse by @sh-rp in #1931
- Fix for multiple ignores is not working (rest_api) by @burnash in #1956
- SQL Database: Support including/excluding NULL cursor values by @steinitzu in #1946
- Add
references
table hint and reflect them insql_database
by @steinitzu in #1925 - only truncate or delete from existing tables in refresh modes by @sh-rp in #1926
- adds bigquery partition expiration and motherduck connection string by @rudolfix in #1968
Experimental interfaces
Below we expose a new pipeline._dataset
and dlt._dataset
interfaces that provide unified access to data loaded into destination. We also implement duckdb
-based SQL client on a filesystem
destination to access data in data lakes. We'll add documentation once we stabilize dataset interface. However already now you can benefit from new cursor
implementation of sql_client
that allows to take data frames, arrow tables also in batches:
- dataset factory by @sh-rp in #1945
- expose readable datasets as dataframes and arrow tables by @sh-rp in #1507
PRs below adds pluggy
and a few first plugin hooks. The idea is to make a lot of functionalities in dlt pluggable. Currently you can plug new cli command (or upgrade existing) and you can also plug your own runtime environment (how dlt looks for data, secrets etc.)
- adds registries and plugins by @rudolfix in #1894
- unifies run configuration and run context by @rudolfix in #1944
Docs
- Update url in deploy-with-airflow-composer.md by @FriedrichtenHagen in #1942
- Added info about backend kwargs in pyarrow by @dat-a-man in #1903
- Docs: sync styles with dlthub by @burnash in #1936
- Docs: styles: remove underline for cards in dark mode by @burnash in #1967
New Contributors
- @FriedrichtenHagen made their first contribution in #1942
Full Changelog: 1.2.0...1.3.0
1.2.0
Core Library
- Sqlalchemy merge support by @steinitzu in #1842
- Fix config sections for synching destinations and accessing destination clients by @sh-rp in #1887
- incremental
scd2
withmerge_key
by @jorritsandbrink in #1818 - fix: UUIDs are not an unknown data type (logging) by @neuromantik33 in #1914
- fix: PageNumberPaginator not reset when iterating through multiple pa… by @paul-godhouse in #1924
- Feat/1922 rest api source add mulitple path parameters by @TheOneTrueAnt in #1923
- enables gcs staging for databricks by @rudolfix in #1933
Docs
- Update weaviate reference by @emmanuel-ferdman in #1896
- Docs: Add sftp option for filesystem source by @VioletM in #1845
- Update installation.md by @erikjamesmason in #1899
- Added troubleshooting section to filesystem docs by @dat-a-man in #1900
- Docs: make naming consistent in the cloud storage & file system source by @burnash in #1835
- Docs: add section on resolving multiple path parameters by @burnash in #1929
New Contributors
- @emmanuel-ferdman made their first contribution in #1896
- @erikjamesmason made their first contribution in #1899
- @neuromantik33 made their first contribution in #1914
- @paul-godhouse made their first contribution in #1924
Full Changelog: 1.1.0...1.2.0
1.1.0
What's Changed
- fix intermittent
delta
panic issue by @jorritsandbrink in #1832 - Sqlalchemy staging dataset support and docs by @steinitzu in #1841
- rest_api: allow specifying custom session (feat/1843) by @willi-mueller in #1844
- Allows any duckdb version, fixes databricks az credentials by @rudolfix in #1854
- Fix/1849 Do Not Parse Ignored Empty Responses by @TheOneTrueAnt in #1851
- feat: filesystem delete old pipeline state files by @donotpush in #1838
- supports adding DltResource in RESTAPIConfig dict by @willi-mueller in #1865
- Fix/1858 make all connection string credentials optional by @rudolfix in #1867
Docs
- sqlalchemy destination docs @steinitzu in #1841
- Docs: move REST API helpers to the REST API category by @burnash in #1852
- Docs: rest_api: document
processing_steps
by @burnash in #1872 - Fix the paginator's doc heading by @burnash in #1869
Verified Sources
- Custom filter clauses supported, pyarrow/arrowmongo requirement optional for Mongo by @Pipboyguy
New Contributors
- @TheOneTrueAnt made their first contribution in #1851
Full Changelog: 1.0.0...1.1.0
1.0.0
This is a major dlt
release. Please check the list of breaking changes and deprecations: #1778
Core Library
- move rest_api, sql_database and filesystem sources to dlt core by @willi-mueller in #1728
- drops
foreign_key
, adds nested references (row_key
-parent_key
) by @rudolfix in #1774 - deprecates
complex
data type, changes tojson
by @rudolfix in #1792 - Feat/1749 abort load package and raise exception on terminal errors in jobs by @willi-mueller in #1781
- Feat/1492 extend timestamp config to handle naive timestamps (without timezone) by @donotpush in #1669
- Fix/1571 Incremental: Optionally load or ignore/exclude/include records with
cursor_path
missing or None value by @willi-mueller in #1576 - creates a single source in extract for all resource instances passed as list by @rudolfix in #1535
- Enable BigQuery schema auto-detection with partitioning and clustering hints by @Pipboyguy in #1806
- Sqlalchemy destination (merge support and docs still in progress) by @steinitzu in #1734
- Feat/1730 extend filesystem sftp by @donotpush in #1769
- Stops dumping secrets to dlt traces. by @willi-mueller in #1797
- Don't use Custom Embedding Functions on LanceDB by @Pipboyguy in #1771
- sets default concurrency for blob upload for adlfs to 1 to avoid massive memory usage on large files by @rudolfix in #1779
- Fix/1790 support incremental load with arrow when cursor column is not nullable by @willi-mueller in #1791
- controls row group size and empty tables in memory buffer when writing parquet by @rudolfix in #1782
- fix installation command" by @novica in #1741
- skips tables without jobs when merging delta tables by @rudolfix in #1803
Docs
- display past versions of the documentation (0.5.x / 1.0.0 / devel) by @sh-rp in #1770
- Refactor filesystem doc by @VioletM in #1745
- Update REST API docs by @akelad in #1795
- Add filesystem tutorial by @VioletM in #1775
- adding the sql_database tutorial by @rahuljo in #1796
- structural and content changes to the sql_database doc by @rahuljo in #1623
- Docs: update the introduction, add the rest_api tutorial by @burnash in #1729
- Docs/update deploy dagster by @mariarice15 in #1761
- Correct wrong code example for apply_hints( incremental(xx) ) by @w0ut0 in #1785
- Moves sources and destinations to the top level in docs navigation by @VioletM in #1750
- Fix typo "frequenly" by @ruudwelten in #1800
- Reorder sidebar by @mariarice15 in #1787
New Contributors
- @novica made their first contribution in #1741
- @mariarice15 made their first contribution in #1761
- @w0ut0 made their first contribution in #1785
- @ruudwelten made their first contribution in #1800
Full Changelog: 0.5.4...1.0.0
0.5.4
Core Library
- BigQuery project_id may be different from credentials project_id by @VioletM in #1680
- Enable schema evolution for
merge
write disposition withdelta
table format by @jorritsandbrink in #1742 - Add
storage_options
toDeltaTable.create
by @jorritsandbrink in #1686 - Fix
delta
table dangling Parquet file bug by @jorritsandbrink in #1695 - Add
delta
table partitioning support by @jorritsandbrink in #1696 - fixes load job counter displayed in progress by @rudolfix in #1702
- RESTClient: stops pagination after empty page (Feat/1637) by @willi-mueller in #1677
- Enable
scd2
record reinsert by @jorritsandbrink in #1707 scd2
custom "valid from" / "valid to" value feature by @jorritsandbrink in #1709- feat/1681 collects load job metrics and adds remote url to traces by @rudolfix in #1708
- locks trace format with a contract @rudolfix in #1708
- Feat/1711 create with not exists for dlt tables to reduce racing conditions by @rudolfix in #1740
- provides detail exception messages when cursor stored value cannot be coerced to data by @rudolfix in #1748
- Allows to configure if staging destination is truncated or left intact to config by @VioletM in #1717
- enables external location and named credential in databricks, allows abfss://container@account Azure urls by @rudolfix in #1755
- fixes #1703 and #1754 by @rudolfix in #1755
Docs:
- rest_api: documents pluggable custom auth by @willi-mueller in #1690
- Update Snowflake docs by @akelad in #1747
- Docs/issue 1661 add tip to source docs and update weaviate docs by @dat-a-man in #1662
- Add custom parent-child relationships example by @dat-a-man in #1678
- Correct the library name for mem stats to
psutil
by @deepyaman in #1733 - Replaced "full_refresh" with "dev_mode" by @dat-a-man in #1735
New Contributors
- @deepyaman made their first contribution in #1733
Full Changelog: 0.5.3...0.5.4
0.5.3
Core Library
- Add support for continuously starting load jobs as slots free up in the loader. This will significantly speed up loading packages with many files. by @sh-rp in #1494
- Add
get_delta_tables
helper function to optimize and vacuum tables by @jorritsandbrink in #1664 - Raise/warn on incomplete columns in normalize by @steinitzu in #1504
- Add enable_dataset_name_normalization option by @VioletM in #1676
- updates duckdb/motherduck load job to match parquet by column names by @rudolfix in #1674
- updates duckdb/motherduck load job to fully allow jsonl file format by @rudolfix in #1674
- removes internal locks when loading parquet from multiple threads (duckdb got fixed) #1674
- enables multi transactions statements for Motherduck #1674
- fixes dbt logs line endings
Docs
Verified Sources
- Column selector added to
sql_database
@steinitzu
New Contributors
Full Changelog: 0.5.2...0.5.3
0.5.2
Core Library
- Add
upsert
merge strategy for Postgres and Snowflake, by @jorritsandbrink in #1466 - Add basic
upsert
support fordelta
table format infilesystem
destination by @jorritsandbrink in #1600 - query tagging for snowflake by @rudolfix in #1582
- Support Open Source ClickHouse Deployments (MergeTree engine and more) by @Pipboyguy in #1496
- allows nested types in BigQuery via native
autodetect_schema
by @rudolfix in #1591 - Enable
upsert
merge strategy for more SQL destinations (Athena, BigQuery, Databricks, mssql) by @jorritsandbrink in #1628 - Fix/1512 fixes
current.pipeline()
access by @rudolfix in #1581 - feat: add config dataset_name_prefix to set custom staging dataset name by @donotpush in #1563
- fix: add airflow db reset for all tests by @donotpush in #1559
- Enable S3 compatible storage for
delta
table format by @jorritsandbrink in #1586 - feat/1495 rest_client: renames JSONResponsePaginator to JSONLinkPaginator by @willi-mueller in #1558
- Feat/1596 adds custom config providers + example of yaml config provider supporting profiles and jinja placeholders by @rudolfix in #1642
- Feat/1583 rest client session timeout configuration by @willi-mueller in #1590
- Add clarification for add_limit by @VioletM in #1594
- Fix/1606 fixes validator incremental step order to keep it always last in the pipe by @rudolfix in #1641
- Feat/1593 rest_client: allow setting of request kwargs by @willi-mueller in #1609
- prevent accidental wrapping of sources in resources when using adapters by @sh-rp in #1645
- Add empty source handling for
delta
table format onfilesystem
destination by @jorritsandbrink in #1617 - Surface original err msg from pydantic as extended_info on DataValidationError by @codingcyclist in #1569
- fix(dockerfile): remove extra spaces around equals sign in LABEL inst… by @thisisdope in #1573
- Qdrant uncommitted state restore and test by @steinitzu in #1545
- fix: suppress alembic logs for tests by @donotpush in #1578
Docs
- Document sql source reflection level and type adapter by @steinitzu in #1467
- Add to docs docs configuring file format options by @VioletM in #1543
- Added how dlt uses arrow by jorrit by @dat-a-man in #1577
- docs/514 rest_api: docs on pluggable paginators by @willi-mueller in #1557
- docs: documents new
convert
parameter in rest_api source incremental config by @willi-mueller in #1649 - Docs/1571 docs on handling NULL values at incremental cursor path by @willi-mueller in #1650
- Add note that pg_replication doesn't support scd2 by @akelad in #1608
- docs/505 updates documentation on custom hooks in response_actions by @willi-mueller in #1524
New Contributors
- @donotpush made their first contribution in #1559
- @thisisdope made their first contribution in #1573
- @akelad made their first contribution in #1608
Full Changelog: 0.5.1...0.5.2