diff --git a/docs/website/docs/general-usage/schema-evolution.md b/docs/website/docs/general-usage/schema-evolution.md index 6ef638886d..897e2ada36 100644 --- a/docs/website/docs/general-usage/schema-evolution.md +++ b/docs/website/docs/general-usage/schema-evolution.md @@ -213,3 +213,80 @@ Demonstrating schema evolution without talking about schema and data contracts i Schema and data contracts can be applied to entities such as ‘tables’, ‘columns’, and ‘data_types’ using contract modes such as ‘evolve’, ‘freeze’, ‘discard_rows’, and ‘discard_columns’ to tell dlt how to apply contracts for a particular entity. To read more about **schema and data contracts**, read our [documentation](./schema-contracts). +## Troubleshooting +This section addresses common schema evolution issues. + +1. #### Inconsistent data types: + - Data sources that vary in data type between pipeline runs may result in additional variant columns and may require extra handling. For example, consider the following pipeline runs: + ```py + # First pipeline run: "value" is an integer + data_run_1 = [ + {"id": 1, "value": 42}, + {"id": 2, "value": 123} + ] + + # Second pipeline run: "value" changes to text + data_run_2 = [ + {"id": 3, "value": "high"}, + {"id": 4, "value": "low"} + ] + + # Third pipeline run: Mixed types in "value" + data_run_3 = [ + {"id": 5, "value": 789}, # back to integer + {"id": 6, "value": "medium"} # mixed types + ] + ``` + + - As a result, the original column remains unchanged and a new variant column value__v_text is created for text values, requiring downstream processes to handle both columns appropriately. + + - **Recommended solutions:** + - **Enforce Type consistency** + - You can enforce type consistency using the `apply_hints` method. This ensure that all values in the column adhere to a specified data type. For example: + + ```py + # Assuming 'resource' is your data resource + resource.apply_hints(columns={ + "value": {"data_type": "text"}, # Enforce 'value' to be of type 'text' + }) + ``` + In this example, the `value` column is always treated as text, even if the original data contains integers or mixed types. + + - **Handle multiple types with separate columns** + - The dlt library automatically handles mixed data types by creating variant columns. If a column contains different data types, dlt generates a separate column for each type. + - For example, if a column named `value` contains both integers and strings, dlt creates a new column called `value__v_text` for the string values. + - After processing multiple runs, the schema will be: + ```text + | name | data_type | nullable | + |--------------|--------------|----------| + | id | bigint | true | + | value | bigint | true | + | value__v_text| text | true | + ``` + + - **Apply Type validation** + - Validate incoming data to ensure that only expected types are processed. For example: + ```py + def validate_value(value): + if not isinstance(value, (int, str)): # Allow only integers and strings + raise TypeError(f"Invalid type: {type(value)}. Expected int or str.") + return str(value) # Convert all values to a consistent type (e.g., text) + + # First pipeline run + data_run_1 = [{"id": 1, "value": validate_value(42)}, + {"id": 2, "value": validate_value(123)}] + + # Second pipeline run + data_run_2 = [{"id": 3, "value": validate_value("high")}, + {"id": 4, "value": validate_value("low")}] + + # Third pipeline run + data_run_3 = [{"id": 7, "value": validate_value([1, 2, 3])}] + ``` + + In this example, `data_run_3` contains an invalid value (a list) instead of an integer or string. When the pipeline runs with `data_run_3`, the `validate_value` function raises a `TypeError`. + +#### 2. Nested data challenges: +- Issues arise due to deep nesting, inconsistent nesting, or unsupported types. + +- To avoid this, you can simplify nested structures or preprocess data [see nested tables](../general-usage/destination-tables#nested-tables) or limit the unnesting level with max_table_nesting. \ No newline at end of file diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index 32850699f1..94374ab544 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -441,3 +441,59 @@ def textual(nesting_level: int): return dlt.resource([]) ``` +## Troubleshooting + +This section addresses common datatype issues. + +### Unsupported timestamps and format issues + +Timestamp issues can occur when the formats are incompatible with the destination or when they change inconsistently between pipeline runs. + +#### 1. Unsupported formats or features +- Combining `precision` and `timezone` in timestamps causes errors in specific destinations (e.g., DuckDB). +- You can simplify the timestamp format to exclude unsupported features. For example: + + ```py + import dlt + + @dlt.resource( + columns={"event_tstamp": {"data_type": "timestamp", "precision": 3, "timezone": False}}, + primary_key="event_id", + ) + def events(): + yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}] + + pipeline = dlt.pipeline(destination="duckdb") + pipeline.run(events()) + ``` + +#### 2. Inconsistent formats across runs +- Different pipeline runs use varying timestamp formats (e.g., `YYYY-MM-DD HH:MM:SS` vs. ISO 8601 vs. non-standard formats). +- As a result, the destination (e.g., BigQuery) might infer the timestamp column in one run, but later runs with incompatible formats (like `20-08-2024` or `04th of January 2024`) result in the creation of variant columns (e.g., `end_date__v_text`). +- It is best practice to standardize timestamp formats across all pipeline runs to maintain consistent column datatype inference. + +#### 3. Inconsistent formats for incremental loading +- Data source returns string timestamps but incremental loading is configured with an integer timestamp value. + - Example: + ```py + # API response + data = [ + {"id": 1, "name": "Item 1", "created_at": "2024-01-01 00:00:00"}, + ] + + # Incorrect configuration (type mismatch) + @dlt.resource(primary_key="id") + def my_data( + created_at=dlt.sources.incremental( + "created_at", + initial_value= 9999 + ) + ): + yield data + ``` +- This makes the pipeline fails with an `IncrementalCursorInvalidCoercion` error because it cannot compare an integer (`initial_value` of 9999) with a string timestamp. The error indicates a type mismatch between the expected and actual data formats. +- To solve this, you can: + - Use string timestamp for incremental loading. + - Convert source data using “add_map”. + - If you need to use timestamps for comparison but want to preserve the original format, create a separate column. + diff --git a/docs/website/docs/reference/performance.md b/docs/website/docs/reference/performance.md index f7773ff83f..23834f04d7 100644 --- a/docs/website/docs/reference/performance.md +++ b/docs/website/docs/reference/performance.md @@ -122,10 +122,20 @@ Below, we set files to rotate after 100,000 items written or when the filesize e +:::note NOTE +When working with a single resource that handles a very large dataset, memory exhaustion may occur during processing. To mitigate this, enable file rotation by configuring `file_max_items` or `file_max_bytes` to split the data into smaller chunks and consider increasing the number of parallel workers for better processing. Read more about [parallel processing.](#parallelism-within-a-pipeline) +::: + ### Disabling and enabling file compression Several [text file formats](../dlt-ecosystem/file-formats/) have `gzip` compression enabled by default. If you wish that your load packages have uncompressed files (e.g., to debug the content easily), change `data_writer.disable_compression` in config.toml. The entry below will disable the compression of the files processed in the `normalize` stage. +### Handling insufficient RAM for in-memory operations +If your available RAM is not sufficient for in-memory operations, consider these optimizations: + +Adjust the `buffer_max_items` setting to fine-tune the size of in-memory buffers. This helps to prevent memory overconsumption when processing large datasets. For more details, [see the buffer configuration guide.](#controlling-in-memory-buffers) + +For handling big data efficiently, process your data in **chunks** rather than loading it entirely into memory. This batching approach allows for more effective resource management and can significantly reduce memory usage. ### Freeing disk space after loading @@ -197,7 +207,7 @@ The default is to not parallelize normalization and to perform it in the main pr ::: :::note -Normalization is CPU-bound and can easily saturate all your cores. Never allow `dlt` to use all cores on your local machine. +Normalization is CPU-bound and can easily saturate all your cores if not configured properly. Too many workers may exhaust resources; too few may underutilize capacity. Never allow dlt to use all available cores on your local machine, adjust the worker settings in your `config.toml` accordingly. ::: :::caution diff --git a/docs/website/docs/walkthroughs/run-a-pipeline.md b/docs/website/docs/walkthroughs/run-a-pipeline.md index 49b5cb33e1..208a52f747 100644 --- a/docs/website/docs/walkthroughs/run-a-pipeline.md +++ b/docs/website/docs/walkthroughs/run-a-pipeline.md @@ -282,6 +282,28 @@ should tell you what went wrong. The most probable cause of the failed job is **the data in the job file**. You can inspect the file using the **JOB file path** provided. +### Exceeding API rate limits + +If your pipeline triggers an HTTP `Error 429`, this means that the API has temporarily blocked your requests due to exceeding the allowed rate limits. Here are some steps to help you troubleshoot and resolve the issue: + +- Ensure that your API credentials are set up correctly so that your requests are properly authenticated. + +- Check the API’s guidelines on rate limits. Look for headers such as `Retry-After` in the response to determine how long you should wait before retrying. + +- Use tools like `time.sleep()` or libraries such as `ratelimiter` to introduce delays between requests. This helps in staying within the allowed limits. + +- Incorporate exponential backoff strategies in your code. This means if a request fails with a `429`, you wait for a short period and then try again, increasing the wait time on subsequent failures. + +- Consider batching requests or caching results to reduce the number of API calls needed during your data load process. + +### Connection failures + +Data loading can be interrupted due to connection issues or database downtime. When this happens, some tables might be partially loaded or even empty, which halts the pipeline process. + +- If the connection is restored, you can resume the load process using the `pipeline.load()` method. This method will pick up from where the previous load stopped and will reload any remaining data packages. + +- In the event that data was partially loaded, check the `dlt_loads` table. If a specific `load_id` is missing from this table, it indicates that the corresponding load has failed. You can then remove any partially loaded data by deleting records associated with those `load_id` values that do not exist in `dlt_loads`. More details can be found in the destination [tables documentation.](../general-usage/destination-tables#load-packages-and-load-ids) + ## Further readings - [Beef up your script for production](../running-in-production/running.md), easily add alerting,