From b91fdd193546c686f57ddd6e7e0c8035c0a96329 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Sat, 25 Jan 2025 08:50:20 +0000 Subject: [PATCH 1/2] Added dedup sort example --- .../docs/general-usage/incremental-loading.md | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/docs/website/docs/general-usage/incremental-loading.md b/docs/website/docs/general-usage/incremental-loading.md index 293a8a03c8..631b1a41ca 100644 --- a/docs/website/docs/general-usage/incremental-loading.md +++ b/docs/website/docs/general-usage/incremental-loading.md @@ -120,6 +120,39 @@ If you use the `merge` write disposition, but do not specify merge or primary ke The appended data will be inserted from a staging table in one transaction for most destinations in this case. ::: +Example: Dedup sort behavior example + +```py +# Sample data +data = [ + {"id": 1, "metadata_modified": "2024-01-01", "value": "A"}, + {"id": 1, "metadata_modified": "2024-01-02", "value": "B"}, + {"id": 2, "metadata_modified": "2024-01-01", "value": "C"}, + {"id": 2, "metadata_modified": "2024-01-01", "value": "D"}, # Same metadata_modified as above +] + +# Define the resource with dedup_sort configuration +@dlt.resource( + primary_key='id', + write_disposition='merge', + columns={ + "metadata_modified": {"dedup_sort": "desc"} + } +) +def sample_data(): + for item in data: + yield item +``` +When this resource is executed, the following deduplication rules are applied: + +1. For records with different values in the dedup_sort column: + - The record with the highest value is kept when using `desc` + - For example, between records with id=1, the one with metadata_modified="2024-01-02" is kept + +2. For records with identical values in the dedup_sort column: + - The first occurrence encountered is kept + - For example, between records with id=2 and identical metadata_modified="2024-01-01", the first record (value="C") is kept + #### Delete records The `hard_delete` column hint can be used to delete records from the destination dataset. The behavior of the delete mechanism depends on the data type of the column marked with the hint: 1) `bool` type: only `True` leads to a delete—`None` and `False` values are disregarded. From a212bb03830e3841c9c64c3b481f6da5f43e682b Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Sat, 25 Jan 2025 12:40:56 +0000 Subject: [PATCH 2/2] Updated formatting --- docs/website/docs/general-usage/incremental-loading.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/website/docs/general-usage/incremental-loading.md b/docs/website/docs/general-usage/incremental-loading.md index 631b1a41ca..5094e56e4a 100644 --- a/docs/website/docs/general-usage/incremental-loading.md +++ b/docs/website/docs/general-usage/incremental-loading.md @@ -120,7 +120,7 @@ If you use the `merge` write disposition, but do not specify merge or primary ke The appended data will be inserted from a staging table in one transaction for most destinations in this case. ::: -Example: Dedup sort behavior example +Example: Deduplication with Timestamp based sorting ```py # Sample data @@ -145,13 +145,13 @@ def sample_data(): ``` When this resource is executed, the following deduplication rules are applied: -1. For records with different values in the dedup_sort column: +1. For records with different values in the `dedup_sort` column: - The record with the highest value is kept when using `desc` - - For example, between records with id=1, the one with metadata_modified="2024-01-02" is kept + - For example, between records with id=1, the one with `"metadata_modified"="2024-01-02"` is kept 2. For records with identical values in the dedup_sort column: - The first occurrence encountered is kept - - For example, between records with id=2 and identical metadata_modified="2024-01-01", the first record (value="C") is kept + - For example, between records with id=2 and identical `"metadata_modified"="2024-01-01"`, the first record (value="C") is kept #### Delete records The `hard_delete` column hint can be used to delete records from the destination dataset. The behavior of the delete mechanism depends on the data type of the column marked with the hint: