From eea10e448c2afc9a7aad42a8561e8744aacd6120 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:22:34 +0000 Subject: [PATCH 1/7] Added new doc on troubleshooting --- .../website/docs/reference/troubleshooting.md | 417 ++++++++++++++++++ docs/website/sidebars.js | 1 + 2 files changed, 418 insertions(+) create mode 100644 docs/website/docs/reference/troubleshooting.md diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md new file mode 100644 index 0000000000..c70964df22 --- /dev/null +++ b/docs/website/docs/reference/troubleshooting.md @@ -0,0 +1,417 @@ +--- +title: Pipeline common failure scenarios and mitigation measures +description: Doc explaining the common failure scenarios in extract, transform and load stage and their mitigation measures +keywords: [faq, usage information, technical help] +--- + +This guide outlines common failure scenarios during the Extract, Normalize, and Load stages of a data pipeline. + +## Extract stage + +Failures during the **Extract** stage often stem from source errors, memory limitations, or timestamp-related issues. Below are common scenarios and possible solutions. + +### Source errors + +Source errors typically result from rate limits, invalid credentials, or misconfigured settings. + +### Common scenarios and possible solutions + +1. **Rate limits (Error 429):** + + - **Scenario:** + - Exceeding API rate limits triggers `Error 429`. + + - **Possible solution:** + - Verify that authentication is functioning as expected. + - Increase the API rate limit if permissible. + - Review the API documentation to understand rate limits and examine headers such as `Retry-After`. + - Implement request delays using functions like `time.sleep()` or libraries such as `ratelimiter` to ensure compliance with rate limits. + - Handle "Too Many Requests" (`429`) responses by implementing retry logic with exponential backoff strategies. + - Optimize API usage by batching requests when possible and caching results to reduce the number of calls. + +2. **Invalid credentials (Error 401, 403, or `ConfigFieldMissingException`):** + + - **Scenario:** + - Missing or invalid credentials cause these errors. + + - **Possible solution:** + - Verify credentials and ensure proper scopes/permissions are enabled. For more on how to set up credentials: [Read our docs](../general-usage/credentials/setup). + - If dlt expects a configuration of secrets value but cannot find it, it will output the `ConfigFieldMissingException`. [Read more about the exceptions here.](../general-usage/credentials/setup#understanding-the-exceptions) + +3. **Source configuration errors (`DictValidationException`):** + + - **Scenario:** + - Incorrect field placement (e.g., `params` outside the `endpoint` field). + - Unexpected fields in the configuration. + - For example, this is the incorrect configuration: + ```py + # ERROR 2: Method outside endpoint + source = rest_api_source( + config={ + "client": { + "base_url": "https://jsonplaceholder.typicode.com/" + }, + "resources": [ + { + "name": "posts", + # Wrong: method should be inside endpoint + "method": "GET", + "endpoint": { + "path": "posts", + "params": { + "_limit": 5 + } + } + } + ] + } + ) + ``` + - Correct configuration: + + ```py + # Create the source first + source = rest_api_source( + config={ + "client": { + "base_url": "https://jsonplaceholder.typicode.com/" + }, + "resources": [ # Add this line + { + "name": "posts", + "endpoint": { + "path": "posts", + "method": "GET", + "params": { + "_limit": 5 + } + } + } + ] + } + ) + ``` + - **Possible solution:** + - Review and validate the code configuration structure against the source documentation. + + Read [REST API’s source here.](../dlt-ecosystem/verified-sources/rest_api/) + +## Memory errors + +Memory issues can disrupt extraction processes. + +### Common scenarios and possible solutions + + 1. **RAM exhaustion:** + + - **Scenario:** + - Available RAM is insufficient for in-memory operations. + + - **Possible solution:** + + 1. **Buffer Size Management:** + - Adjust `max_buffer_items` to limit buffer size. [Learn about buffer configuration.](./performance#controlling-in-memory-buffers) + 2. Streaming Processing + - Big data should be processed in chunks for efficient handling. + + 2. **Storage memory shortages:** + + - **Scenario:** + + - Intermediate files exceed available storage space. + + - **Possible solution:** + + - If your storage reaches its limit, you can mount an external cloud storage location and set the `DLT_DATA_DIR` environment variable to point to it. This ensures that dlt uses the mounted storage as its data directory instead of local disk space. [Read more here.](./performance) + +## Unsupported timestamps + +Timestamp issues occur when formats are incompatible with the destination or inconsistent across pipeline runs. + +### Common scenarios and possible solutions + +1. **Unsupported formats or features:** + + - **Scenario:** + + - Combining `precision` and `timezone` in timestamps causes errors in specific destinations (e.g., DuckDB). + + - **Possible solution:** + + - Simplify the timestamp format to exclude unsupported features. Example: + + ```py + import dlt + + @dlt.resource( + columns={"event_tstamp": {"data_type": "timestamp", "precision": 3, "timezone": False}}, + primary_key="event_id", + ) + def events(): + yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}] + + pipeline = dlt.pipeline(destination="duckdb") + pipeline.run(events()) + ``` + +2. **Inconsistent formats across runs:** + + - **Scenario:** + + - Different pipeline runs use varying timestamp formats, affecting column datatype inference at the destination. + + - **Impact:** + + - For instance: + - **1st pipeline run:** `{"id": 1, "end_date": "2024-02-28 00:00:00"}` + - **2nd pipeline run:** `{"id": 2, "end_date": "2024/02/28"}` + - **3rd pipeline run:** `{"id": 3, "end_date": "2024-07-30T10:00:00.123456789"}` + + - If the first run uses a timestamp-compatible format (e.g., `YYYY-MM-DD HH:MM:SS`), the destination (BigQuery) infers the column as a `TIMESTAMP`. Subsequent runs using compatible formats are automatically converted to this type. + + - However, introducing incompatible formats later, such as: + - **4th pipeline run:** `{"id": 4, "end_date": "20-08-2024"}` (DD-MM-YYYY) + - **5th pipeline run:** `{"id": 5, "end_date": "04th of January 2024"}` + + - BigQuery will interpret these as text and create a new variant column (`end_date__v_text`) to store the incompatible values. This preserves the schema consistency while accommodating all data. + + - **Possible solution:** + + - Standardize timestamp formats across all runs to maintain consistent schema inference and avoid the creation of variant columns. + +3. Inconsistent formats for incremental loading + + - **Scenario:** + + - Data source returns string timestamps but incremental loading is configured with an integer timestamp value. + - Example: + ```py + # API response + data = [ + {"id": 1, "name": "Item 1", "created_at": "2024-01-01 00:00:00"}, + ] + + # Incorrect configuration (type mismatch) + @dlt.resource(primary_key="id") + def my_data( + created_at=dlt.sources.incremental( + "created_at", + initial_value= 9999 + ) + ): + yield data + ``` + + - **Impact:** + + - Pipeline fails with `IncrementalCursorInvalidCoercion` error + - Error message indicates comparison failure between integer and string types + - Unable to perform incremental loading until type mismatch is resolved + + - **Possible Solutions:** + + - Use string timestamp for incremental loading. + - Convert source data using “add_map”. + - If you need to use timestamps for comparison but want to preserve the original format, create a separate column. + +## Normalize stage + +Failures during the **Normalize** stage commonly arise from memory limitations, parallelization issues, or schema inference errors. + +### Memory errors + +Memory-intensive operations may fail during normalization. + +### Common scenarios and possible solutions + +1. **Large dataset in one resource:** + + - **Scenario:** + + - Large datasets exhaust memory during processing. + + - **Possible solution:** + + - Enable file rotation using `file_max_items` or `file_max_bytes`. + - Increase parallel workers for better processing. [Read more about parallel processing.](./performance#parallelism-within-a-pipeline) + +2. **Storage memory shortages** + + - **Scenario:** + + - When lots of files are being processed, the available storage space might be insufficient. + + - **Possible solution:** + + - If your storage reaches its limit, you can mount an external cloud storage location and set the `DLT_DATA_DIR` environment variable to point to it. This ensures that dlt uses the mounted storage as its data directory instead of local disk space. [Read more here.](./performance#keep-pipeline-working-folder-in-a-bucket-on-constrained-environments) + +### Parallelization issues + +Improper configuration of workers may lead to inefficiencies or failures. + +### Common scenarios and possible solutions + +1. **Resource exhaustion or underutilization:** + + - **Scenario:** + + - Too many workers may exhaust resources; too few may underutilize capacity. + + - **Possible solution:** + + - Adjust worker settings in the `config.toml` file. [Read more about parallel processing.](./performance#parallelism-within-a-pipeline) + +2. **Threading conflicts:** + + - **Scenario:** + + - The `fork` process spawning method (default on Linux) conflicts with threaded libraries. + + - **Possible solution:** + + - Switch to the `spawn` method for process pool creation. [Learn more about process spawning.](./performance#normalize) + +### Schema inference errors + +Complex or inconsistent data structures can cause schema inference failures. + +### Common scenarios and possible solutions + +1. **Inconsistent data types:** + + - **Scenario:** + ```py + # First pipeline run + data_run_1 = [ + {"id": 1, "value": 42}, # value is integer + {"id": 2, "value": 123} + ] + + # Second pipeline run + data_run_2 = [ + {"id": 3, "value": "high"}, # value changes to text + {"id": 4, "value": "low"} + ] + + # Third pipeline run + data_run_3 = [ + {"id": 5, "value": 789}, # back to integer + {"id": 6, "value": "medium"} # mixed types + ] + ``` + + - **Impact:** + + - Original column remains as is. + - New variant column `value__v_text` created for text values. + - May require additional data handling in downstream processes. + + - **Possible solutions:** + + - Enforce Type Consistency + - You can enforce type consistency using the `apply_hints` method. This ensures all values in a column follow the specified data type. + + ```python + # Assuming 'resource' is your data resource + resource.apply_hints(columns={ + "value": {"data_type": "text"}, # Enforce 'value' to be of type 'text' + }) + ``` + + - In this example, the `value` column is always treated as text, even if the original data contains integers or mixed types. + + - Handle multiple types with separate columns. + - The `dlt` library automatically handles mixed data types by creating variant columns. If a column contains different data types, `dlt` generates a separate column for each type. + - For example, if a column named `value` contains both integers and strings, `dlt` creates a new column called `value__v_text` for the string values. + - After processing multiple runs, the schema will be: + + ```python + | name | data_type | nullable | + |---------------|---------------|----------| + | id | bigint | true | + | value | bigint | true | + | value__v_text | text | true | + ``` + + - Use Type validation **to Ensure Consistency** + - When processing pipeline runs with mixed data types, type validation can be applied to enforce strict type rules. + + - **Example:** + ```py + def validate_value(value): + if not isinstance(value, (int, str)): # Allow only integers and strings + raise TypeError(f"Invalid type: {type(value)}. Expected int or str.") + return str(value) # Convert all values to a consistent type (e.g., text) + + # First pipeline run + data_run_1 = [{"id": 1, "value": validate_value(42)}, + {"id": 2, "value": validate_value(123)}] + + # Second pipeline run + data_run_2 = [{"id": 3, "value": validate_value("high")}, + {"id": 4, "value": validate_value("low")}] + + # Third pipeline run + data_run_3 = [{"id": 7, "value": validate_value([1, 2, 3])}] + ``` + + In this example, data_run_4 contains an invalid value (a list) instead of an integer or string. When the pipeline runs with data_run_4, the validate_value function raises a TypeError. + +2. **Nested data challenges:** + + - **Scenario:** + + - Issues arise due to deep nesting, inconsistent nesting, or unsupported types. + + - **Possible solution:** + + - Simplify nested structures or preprocess data. [Read about nested tables.](../general-usage/destination-tables#nested-tables) + - You can limit unnesting level with `max_table_nesting`. + +## Load stage + +Failures in the **Load** stage often relate to authentication issues, schema changes, datatype mismatches or memory problems. + +### Authentication and connection failures + +### Common scenarios and possible solutions + +- **Scenario:** + + - Incorrect credentials. + - Data loading is interrupted due to connection issues or database downtime. This may leave some tables partially loaded or completely empty, halting the pipeline process. + +- **Possible solution:** + + - Verify credentials and follow proper setup instructions. [Credential setup guide.](../general-usage/credentials/setup) + - If the connection is restored, you can resume the load process using the `pipeline.load()` method. This ensures the pipeline picks up from where it stopped, reloading any remaining data packages. + - If data was **partially loaded**, check the `dlt_loads` table. If a `load_id` is missing from this table, it means the corresponding load **failed**. You can then remove partially loaded data by deleting any records associated with `load_id` values that do not exist in `dlt_loads`. [More details here](../general-usage/destination-tables#load-packages-and-load-ids). + +### Schema changes (e.g., column renaming, Datatype mismatches) + +### Common scenarios and possible solutions + +- **Scenario:** + + - Renamed columns create variant columns in the destination schema. + - Incoming datatypes that the destination doesn’t support result in variant columns. + +- **Possible solution:** + + - Use schema evolution to handle column renaming. [Read more about schema evolution.](../general-usage/schema-evolution#evolving-the-schema) + +### Memory management issues + +- **Scenario:** + + - Loading large datasets without file rotation enabled. This would make dlt try to upload a huge data set into destination at once. *(Note: Rotation is disabled by default.)* + +- **Impact:** + + - Pipeline failures due to out-of-memory errors. + +- **Solution:** + + - Enable file rotation. [Read more about it here.](./performance#controlling-intermediary-file-size-and-rotation) + +By identifying potential failure scenarios and applying the suggested mitigation strategies, you can ensure reliable and efficient pipeline performance. \ No newline at end of file diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index d4fd9d4341..8432834706 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -463,6 +463,7 @@ const sidebars = { 'dlt-ecosystem/table-formats/iceberg', ] }, + 'reference/troubleshooting', 'reference/frequently-asked-questions', ], }, From 2a269b875253898fe84fb45122b4aa4af76c9ec9 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Sun, 16 Feb 2025 03:20:42 +0000 Subject: [PATCH 2/7] Updated doc --- docs/website/docs/reference/troubleshooting.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md index c70964df22..087b0f3efa 100644 --- a/docs/website/docs/reference/troubleshooting.md +++ b/docs/website/docs/reference/troubleshooting.md @@ -311,7 +311,7 @@ Complex or inconsistent data structures can cause schema inference failures. - Enforce Type Consistency - You can enforce type consistency using the `apply_hints` method. This ensures all values in a column follow the specified data type. - ```python + ```py # Assuming 'resource' is your data resource resource.apply_hints(columns={ "value": {"data_type": "text"}, # Enforce 'value' to be of type 'text' @@ -325,7 +325,7 @@ Complex or inconsistent data structures can cause schema inference failures. - For example, if a column named `value` contains both integers and strings, `dlt` creates a new column called `value__v_text` for the string values. - After processing multiple runs, the schema will be: - ```python + ```text | name | data_type | nullable | |---------------|---------------|----------| | id | bigint | true | From 7c939e9318a23476a4501a0a1da514e352075fd7 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Tue, 18 Feb 2025 13:09:03 +0000 Subject: [PATCH 3/7] Updated as per comments --- docs/website/docs/reference/troubleshooting.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md index 087b0f3efa..a974d696b7 100644 --- a/docs/website/docs/reference/troubleshooting.md +++ b/docs/website/docs/reference/troubleshooting.md @@ -1,9 +1,11 @@ --- -title: Pipeline common failure scenarios and mitigation measures +title: Troubleshooting description: Doc explaining the common failure scenarios in extract, transform and load stage and their mitigation measures keywords: [faq, usage information, technical help] --- +# Pipeline common failure scenarios and mitigation measures + This guide outlines common failure scenarios during the Extract, Normalize, and Load stages of a data pipeline. ## Extract stage @@ -96,7 +98,7 @@ Source errors typically result from rate limits, invalid credentials, or misconf Read [REST API’s source here.](../dlt-ecosystem/verified-sources/rest_api/) -## Memory errors +### Memory errors Memory issues can disrupt extraction processes. @@ -124,7 +126,7 @@ Memory issues can disrupt extraction processes. - If your storage reaches its limit, you can mount an external cloud storage location and set the `DLT_DATA_DIR` environment variable to point to it. This ensures that dlt uses the mounted storage as its data directory instead of local disk space. [Read more here.](./performance) -## Unsupported timestamps +### Unsupported timestamps Timestamp issues occur when formats are incompatible with the destination or inconsistent across pipeline runs. From 4b4bb68a18b7a830e058e637608eefa37933cc07 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Sun, 23 Feb 2025 09:33:23 +0000 Subject: [PATCH 4/7] Updated for filenotfound error --- .../website/docs/reference/troubleshooting.md | 35 +++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md index a974d696b7..4be54ab679 100644 --- a/docs/website/docs/reference/troubleshooting.md +++ b/docs/website/docs/reference/troubleshooting.md @@ -181,7 +181,7 @@ Timestamp issues occur when formats are incompatible with the destination or inc - Standardize timestamp formats across all runs to maintain consistent schema inference and avoid the creation of variant columns. -3. Inconsistent formats for incremental loading +3. **Inconsistent formats for incremental loading** - **Scenario:** @@ -402,6 +402,37 @@ Failures in the **Load** stage often relate to authentication issues, schema cha - Use schema evolution to handle column renaming. [Read more about schema evolution.](../general-usage/schema-evolution#evolving-the-schema) +### **`FileNotFoundError` for 'schema_updates.json' in parallel runs** + +- **Scenario** + When running the same pipeline name multiple times in parallel (e.g., via Airflow), `dlt` may fail at the load stage with an error like: + + > `FileNotFoundError: schema_updates.json not found` + + This happens because `schema_updates.json` is generated during normalization. Concurrent runs using the same pipeline name may overwrite or lock access to this file, causing failures. + +- **Possible Solutions** + + 1. **Use unique pipeline names for each parallel run** + + If calling `pipeline.run()` multiple times within the same workflow (e.g., once per resource), assign a unique `pipeline_name` for each run. This ensures separate working directories, preventing file conflicts. + + 2. **Leverage dlt’s concurrency management or Airflow helpers** + + dlt’s Airflow integration “serializes” resources into separate tasks while safely handling concurrency. To parallelize resource extraction without file conflicts, use: + ```py + decompose="serialize" + ``` + More details are available in the [Airflow documentation](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-valueerror-can-only-decompose-dlt-source). + + 3. **Disable dev mode to prevent multiple destination datasets** + + When `dev_mode=True`, dlt generates unique dataset names (`_`) for each run. To maintain a consistent dataset, set: + ```py + dev_mode=False + ``` + Read more about this in the [dev mode documentation](../general-usage/pipeline#do-experiments-with-dev-mode). + ### Memory management issues - **Scenario:** @@ -412,7 +443,7 @@ Failures in the **Load** stage often relate to authentication issues, schema cha - Pipeline failures due to out-of-memory errors. -- **Solution:** +- **Possible Solution:** - Enable file rotation. [Read more about it here.](./performance#controlling-intermediary-file-size-and-rotation) From 94dfe8202c6c084617ac211a34d3806db9d32048 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Mon, 24 Feb 2025 13:59:19 +0000 Subject: [PATCH 5/7] Updated --- docs/website/docs/reference/troubleshooting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md index 4be54ab679..2fe8065378 100644 --- a/docs/website/docs/reference/troubleshooting.md +++ b/docs/website/docs/reference/troubleshooting.md @@ -402,7 +402,7 @@ Failures in the **Load** stage often relate to authentication issues, schema cha - Use schema evolution to handle column renaming. [Read more about schema evolution.](../general-usage/schema-evolution#evolving-the-schema) -### **`FileNotFoundError` for 'schema_updates.json' in parallel runs** +### `FileNotFoundError` for 'schema_updates.json' in parallel runs - **Scenario** When running the same pipeline name multiple times in parallel (e.g., via Airflow), `dlt` may fail at the load stage with an error like: From 72dcc99354cc39a09a3337bf0661cf63018cf0fb Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Mon, 3 Mar 2025 13:02:49 +0000 Subject: [PATCH 6/7] Added sections of troubleshooting to performance, schema and schema evolution. --- .../docs/general-usage/schema-evolution.md | 77 +++++++++++++++++++ docs/website/docs/general-usage/schema.md | 56 ++++++++++++++ docs/website/docs/reference/performance.md | 12 ++- .../docs/walkthroughs/run-a-pipeline.md | 22 ++++++ 4 files changed, 166 insertions(+), 1 deletion(-) diff --git a/docs/website/docs/general-usage/schema-evolution.md b/docs/website/docs/general-usage/schema-evolution.md index 6ef638886d..897e2ada36 100644 --- a/docs/website/docs/general-usage/schema-evolution.md +++ b/docs/website/docs/general-usage/schema-evolution.md @@ -213,3 +213,80 @@ Demonstrating schema evolution without talking about schema and data contracts i Schema and data contracts can be applied to entities such as ‘tables’, ‘columns’, and ‘data_types’ using contract modes such as ‘evolve’, ‘freeze’, ‘discard_rows’, and ‘discard_columns’ to tell dlt how to apply contracts for a particular entity. To read more about **schema and data contracts**, read our [documentation](./schema-contracts). +## Troubleshooting +This section addresses common schema evolution issues. + +1. #### Inconsistent data types: + - Data sources that vary in data type between pipeline runs may result in additional variant columns and may require extra handling. For example, consider the following pipeline runs: + ```py + # First pipeline run: "value" is an integer + data_run_1 = [ + {"id": 1, "value": 42}, + {"id": 2, "value": 123} + ] + + # Second pipeline run: "value" changes to text + data_run_2 = [ + {"id": 3, "value": "high"}, + {"id": 4, "value": "low"} + ] + + # Third pipeline run: Mixed types in "value" + data_run_3 = [ + {"id": 5, "value": 789}, # back to integer + {"id": 6, "value": "medium"} # mixed types + ] + ``` + + - As a result, the original column remains unchanged and a new variant column value__v_text is created for text values, requiring downstream processes to handle both columns appropriately. + + - **Recommended solutions:** + - **Enforce Type consistency** + - You can enforce type consistency using the `apply_hints` method. This ensure that all values in the column adhere to a specified data type. For example: + + ```py + # Assuming 'resource' is your data resource + resource.apply_hints(columns={ + "value": {"data_type": "text"}, # Enforce 'value' to be of type 'text' + }) + ``` + In this example, the `value` column is always treated as text, even if the original data contains integers or mixed types. + + - **Handle multiple types with separate columns** + - The dlt library automatically handles mixed data types by creating variant columns. If a column contains different data types, dlt generates a separate column for each type. + - For example, if a column named `value` contains both integers and strings, dlt creates a new column called `value__v_text` for the string values. + - After processing multiple runs, the schema will be: + ```text + | name | data_type | nullable | + |--------------|--------------|----------| + | id | bigint | true | + | value | bigint | true | + | value__v_text| text | true | + ``` + + - **Apply Type validation** + - Validate incoming data to ensure that only expected types are processed. For example: + ```py + def validate_value(value): + if not isinstance(value, (int, str)): # Allow only integers and strings + raise TypeError(f"Invalid type: {type(value)}. Expected int or str.") + return str(value) # Convert all values to a consistent type (e.g., text) + + # First pipeline run + data_run_1 = [{"id": 1, "value": validate_value(42)}, + {"id": 2, "value": validate_value(123)}] + + # Second pipeline run + data_run_2 = [{"id": 3, "value": validate_value("high")}, + {"id": 4, "value": validate_value("low")}] + + # Third pipeline run + data_run_3 = [{"id": 7, "value": validate_value([1, 2, 3])}] + ``` + + In this example, `data_run_3` contains an invalid value (a list) instead of an integer or string. When the pipeline runs with `data_run_3`, the `validate_value` function raises a `TypeError`. + +#### 2. Nested data challenges: +- Issues arise due to deep nesting, inconsistent nesting, or unsupported types. + +- To avoid this, you can simplify nested structures or preprocess data [see nested tables](../general-usage/destination-tables#nested-tables) or limit the unnesting level with max_table_nesting. \ No newline at end of file diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md index 32850699f1..94374ab544 100644 --- a/docs/website/docs/general-usage/schema.md +++ b/docs/website/docs/general-usage/schema.md @@ -441,3 +441,59 @@ def textual(nesting_level: int): return dlt.resource([]) ``` +## Troubleshooting + +This section addresses common datatype issues. + +### Unsupported timestamps and format issues + +Timestamp issues can occur when the formats are incompatible with the destination or when they change inconsistently between pipeline runs. + +#### 1. Unsupported formats or features +- Combining `precision` and `timezone` in timestamps causes errors in specific destinations (e.g., DuckDB). +- You can simplify the timestamp format to exclude unsupported features. For example: + + ```py + import dlt + + @dlt.resource( + columns={"event_tstamp": {"data_type": "timestamp", "precision": 3, "timezone": False}}, + primary_key="event_id", + ) + def events(): + yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}] + + pipeline = dlt.pipeline(destination="duckdb") + pipeline.run(events()) + ``` + +#### 2. Inconsistent formats across runs +- Different pipeline runs use varying timestamp formats (e.g., `YYYY-MM-DD HH:MM:SS` vs. ISO 8601 vs. non-standard formats). +- As a result, the destination (e.g., BigQuery) might infer the timestamp column in one run, but later runs with incompatible formats (like `20-08-2024` or `04th of January 2024`) result in the creation of variant columns (e.g., `end_date__v_text`). +- It is best practice to standardize timestamp formats across all pipeline runs to maintain consistent column datatype inference. + +#### 3. Inconsistent formats for incremental loading +- Data source returns string timestamps but incremental loading is configured with an integer timestamp value. + - Example: + ```py + # API response + data = [ + {"id": 1, "name": "Item 1", "created_at": "2024-01-01 00:00:00"}, + ] + + # Incorrect configuration (type mismatch) + @dlt.resource(primary_key="id") + def my_data( + created_at=dlt.sources.incremental( + "created_at", + initial_value= 9999 + ) + ): + yield data + ``` +- This makes the pipeline fails with an `IncrementalCursorInvalidCoercion` error because it cannot compare an integer (`initial_value` of 9999) with a string timestamp. The error indicates a type mismatch between the expected and actual data formats. +- To solve this, you can: + - Use string timestamp for incremental loading. + - Convert source data using “add_map”. + - If you need to use timestamps for comparison but want to preserve the original format, create a separate column. + diff --git a/docs/website/docs/reference/performance.md b/docs/website/docs/reference/performance.md index f7773ff83f..23834f04d7 100644 --- a/docs/website/docs/reference/performance.md +++ b/docs/website/docs/reference/performance.md @@ -122,10 +122,20 @@ Below, we set files to rotate after 100,000 items written or when the filesize e +:::note NOTE +When working with a single resource that handles a very large dataset, memory exhaustion may occur during processing. To mitigate this, enable file rotation by configuring `file_max_items` or `file_max_bytes` to split the data into smaller chunks and consider increasing the number of parallel workers for better processing. Read more about [parallel processing.](#parallelism-within-a-pipeline) +::: + ### Disabling and enabling file compression Several [text file formats](../dlt-ecosystem/file-formats/) have `gzip` compression enabled by default. If you wish that your load packages have uncompressed files (e.g., to debug the content easily), change `data_writer.disable_compression` in config.toml. The entry below will disable the compression of the files processed in the `normalize` stage. +### Handling insufficient RAM for in-memory operations +If your available RAM is not sufficient for in-memory operations, consider these optimizations: + +Adjust the `buffer_max_items` setting to fine-tune the size of in-memory buffers. This helps to prevent memory overconsumption when processing large datasets. For more details, [see the buffer configuration guide.](#controlling-in-memory-buffers) + +For handling big data efficiently, process your data in **chunks** rather than loading it entirely into memory. This batching approach allows for more effective resource management and can significantly reduce memory usage. ### Freeing disk space after loading @@ -197,7 +207,7 @@ The default is to not parallelize normalization and to perform it in the main pr ::: :::note -Normalization is CPU-bound and can easily saturate all your cores. Never allow `dlt` to use all cores on your local machine. +Normalization is CPU-bound and can easily saturate all your cores if not configured properly. Too many workers may exhaust resources; too few may underutilize capacity. Never allow dlt to use all available cores on your local machine, adjust the worker settings in your `config.toml` accordingly. ::: :::caution diff --git a/docs/website/docs/walkthroughs/run-a-pipeline.md b/docs/website/docs/walkthroughs/run-a-pipeline.md index 49b5cb33e1..208a52f747 100644 --- a/docs/website/docs/walkthroughs/run-a-pipeline.md +++ b/docs/website/docs/walkthroughs/run-a-pipeline.md @@ -282,6 +282,28 @@ should tell you what went wrong. The most probable cause of the failed job is **the data in the job file**. You can inspect the file using the **JOB file path** provided. +### Exceeding API rate limits + +If your pipeline triggers an HTTP `Error 429`, this means that the API has temporarily blocked your requests due to exceeding the allowed rate limits. Here are some steps to help you troubleshoot and resolve the issue: + +- Ensure that your API credentials are set up correctly so that your requests are properly authenticated. + +- Check the API’s guidelines on rate limits. Look for headers such as `Retry-After` in the response to determine how long you should wait before retrying. + +- Use tools like `time.sleep()` or libraries such as `ratelimiter` to introduce delays between requests. This helps in staying within the allowed limits. + +- Incorporate exponential backoff strategies in your code. This means if a request fails with a `429`, you wait for a short period and then try again, increasing the wait time on subsequent failures. + +- Consider batching requests or caching results to reduce the number of API calls needed during your data load process. + +### Connection failures + +Data loading can be interrupted due to connection issues or database downtime. When this happens, some tables might be partially loaded or even empty, which halts the pipeline process. + +- If the connection is restored, you can resume the load process using the `pipeline.load()` method. This method will pick up from where the previous load stopped and will reload any remaining data packages. + +- In the event that data was partially loaded, check the `dlt_loads` table. If a specific `load_id` is missing from this table, it indicates that the corresponding load has failed. You can then remove any partially loaded data by deleting records associated with those `load_id` values that do not exist in `dlt_loads`. More details can be found in the destination [tables documentation.](../general-usage/destination-tables#load-packages-and-load-ids) + ## Further readings - [Beef up your script for production](../running-in-production/running.md), easily add alerting, From 8ca85f98f6210c43c1120190d0758d8a147d2bec Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Mon, 3 Mar 2025 13:04:39 +0000 Subject: [PATCH 7/7] removed trouble shooting doc --- .../website/docs/reference/troubleshooting.md | 450 ------------------ docs/website/sidebars.js | 1 - 2 files changed, 451 deletions(-) delete mode 100644 docs/website/docs/reference/troubleshooting.md diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md deleted file mode 100644 index 2fe8065378..0000000000 --- a/docs/website/docs/reference/troubleshooting.md +++ /dev/null @@ -1,450 +0,0 @@ ---- -title: Troubleshooting -description: Doc explaining the common failure scenarios in extract, transform and load stage and their mitigation measures -keywords: [faq, usage information, technical help] ---- - -# Pipeline common failure scenarios and mitigation measures - -This guide outlines common failure scenarios during the Extract, Normalize, and Load stages of a data pipeline. - -## Extract stage - -Failures during the **Extract** stage often stem from source errors, memory limitations, or timestamp-related issues. Below are common scenarios and possible solutions. - -### Source errors - -Source errors typically result from rate limits, invalid credentials, or misconfigured settings. - -### Common scenarios and possible solutions - -1. **Rate limits (Error 429):** - - - **Scenario:** - - Exceeding API rate limits triggers `Error 429`. - - - **Possible solution:** - - Verify that authentication is functioning as expected. - - Increase the API rate limit if permissible. - - Review the API documentation to understand rate limits and examine headers such as `Retry-After`. - - Implement request delays using functions like `time.sleep()` or libraries such as `ratelimiter` to ensure compliance with rate limits. - - Handle "Too Many Requests" (`429`) responses by implementing retry logic with exponential backoff strategies. - - Optimize API usage by batching requests when possible and caching results to reduce the number of calls. - -2. **Invalid credentials (Error 401, 403, or `ConfigFieldMissingException`):** - - - **Scenario:** - - Missing or invalid credentials cause these errors. - - - **Possible solution:** - - Verify credentials and ensure proper scopes/permissions are enabled. For more on how to set up credentials: [Read our docs](../general-usage/credentials/setup). - - If dlt expects a configuration of secrets value but cannot find it, it will output the `ConfigFieldMissingException`. [Read more about the exceptions here.](../general-usage/credentials/setup#understanding-the-exceptions) - -3. **Source configuration errors (`DictValidationException`):** - - - **Scenario:** - - Incorrect field placement (e.g., `params` outside the `endpoint` field). - - Unexpected fields in the configuration. - - For example, this is the incorrect configuration: - ```py - # ERROR 2: Method outside endpoint - source = rest_api_source( - config={ - "client": { - "base_url": "https://jsonplaceholder.typicode.com/" - }, - "resources": [ - { - "name": "posts", - # Wrong: method should be inside endpoint - "method": "GET", - "endpoint": { - "path": "posts", - "params": { - "_limit": 5 - } - } - } - ] - } - ) - ``` - - Correct configuration: - - ```py - # Create the source first - source = rest_api_source( - config={ - "client": { - "base_url": "https://jsonplaceholder.typicode.com/" - }, - "resources": [ # Add this line - { - "name": "posts", - "endpoint": { - "path": "posts", - "method": "GET", - "params": { - "_limit": 5 - } - } - } - ] - } - ) - ``` - - **Possible solution:** - - Review and validate the code configuration structure against the source documentation. - - Read [REST API’s source here.](../dlt-ecosystem/verified-sources/rest_api/) - -### Memory errors - -Memory issues can disrupt extraction processes. - -### Common scenarios and possible solutions - - 1. **RAM exhaustion:** - - - **Scenario:** - - Available RAM is insufficient for in-memory operations. - - - **Possible solution:** - - 1. **Buffer Size Management:** - - Adjust `max_buffer_items` to limit buffer size. [Learn about buffer configuration.](./performance#controlling-in-memory-buffers) - 2. Streaming Processing - - Big data should be processed in chunks for efficient handling. - - 2. **Storage memory shortages:** - - - **Scenario:** - - - Intermediate files exceed available storage space. - - - **Possible solution:** - - - If your storage reaches its limit, you can mount an external cloud storage location and set the `DLT_DATA_DIR` environment variable to point to it. This ensures that dlt uses the mounted storage as its data directory instead of local disk space. [Read more here.](./performance) - -### Unsupported timestamps - -Timestamp issues occur when formats are incompatible with the destination or inconsistent across pipeline runs. - -### Common scenarios and possible solutions - -1. **Unsupported formats or features:** - - - **Scenario:** - - - Combining `precision` and `timezone` in timestamps causes errors in specific destinations (e.g., DuckDB). - - - **Possible solution:** - - - Simplify the timestamp format to exclude unsupported features. Example: - - ```py - import dlt - - @dlt.resource( - columns={"event_tstamp": {"data_type": "timestamp", "precision": 3, "timezone": False}}, - primary_key="event_id", - ) - def events(): - yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}] - - pipeline = dlt.pipeline(destination="duckdb") - pipeline.run(events()) - ``` - -2. **Inconsistent formats across runs:** - - - **Scenario:** - - - Different pipeline runs use varying timestamp formats, affecting column datatype inference at the destination. - - - **Impact:** - - - For instance: - - **1st pipeline run:** `{"id": 1, "end_date": "2024-02-28 00:00:00"}` - - **2nd pipeline run:** `{"id": 2, "end_date": "2024/02/28"}` - - **3rd pipeline run:** `{"id": 3, "end_date": "2024-07-30T10:00:00.123456789"}` - - - If the first run uses a timestamp-compatible format (e.g., `YYYY-MM-DD HH:MM:SS`), the destination (BigQuery) infers the column as a `TIMESTAMP`. Subsequent runs using compatible formats are automatically converted to this type. - - - However, introducing incompatible formats later, such as: - - **4th pipeline run:** `{"id": 4, "end_date": "20-08-2024"}` (DD-MM-YYYY) - - **5th pipeline run:** `{"id": 5, "end_date": "04th of January 2024"}` - - - BigQuery will interpret these as text and create a new variant column (`end_date__v_text`) to store the incompatible values. This preserves the schema consistency while accommodating all data. - - - **Possible solution:** - - - Standardize timestamp formats across all runs to maintain consistent schema inference and avoid the creation of variant columns. - -3. **Inconsistent formats for incremental loading** - - - **Scenario:** - - - Data source returns string timestamps but incremental loading is configured with an integer timestamp value. - - Example: - ```py - # API response - data = [ - {"id": 1, "name": "Item 1", "created_at": "2024-01-01 00:00:00"}, - ] - - # Incorrect configuration (type mismatch) - @dlt.resource(primary_key="id") - def my_data( - created_at=dlt.sources.incremental( - "created_at", - initial_value= 9999 - ) - ): - yield data - ``` - - - **Impact:** - - - Pipeline fails with `IncrementalCursorInvalidCoercion` error - - Error message indicates comparison failure between integer and string types - - Unable to perform incremental loading until type mismatch is resolved - - - **Possible Solutions:** - - - Use string timestamp for incremental loading. - - Convert source data using “add_map”. - - If you need to use timestamps for comparison but want to preserve the original format, create a separate column. - -## Normalize stage - -Failures during the **Normalize** stage commonly arise from memory limitations, parallelization issues, or schema inference errors. - -### Memory errors - -Memory-intensive operations may fail during normalization. - -### Common scenarios and possible solutions - -1. **Large dataset in one resource:** - - - **Scenario:** - - - Large datasets exhaust memory during processing. - - - **Possible solution:** - - - Enable file rotation using `file_max_items` or `file_max_bytes`. - - Increase parallel workers for better processing. [Read more about parallel processing.](./performance#parallelism-within-a-pipeline) - -2. **Storage memory shortages** - - - **Scenario:** - - - When lots of files are being processed, the available storage space might be insufficient. - - - **Possible solution:** - - - If your storage reaches its limit, you can mount an external cloud storage location and set the `DLT_DATA_DIR` environment variable to point to it. This ensures that dlt uses the mounted storage as its data directory instead of local disk space. [Read more here.](./performance#keep-pipeline-working-folder-in-a-bucket-on-constrained-environments) - -### Parallelization issues - -Improper configuration of workers may lead to inefficiencies or failures. - -### Common scenarios and possible solutions - -1. **Resource exhaustion or underutilization:** - - - **Scenario:** - - - Too many workers may exhaust resources; too few may underutilize capacity. - - - **Possible solution:** - - - Adjust worker settings in the `config.toml` file. [Read more about parallel processing.](./performance#parallelism-within-a-pipeline) - -2. **Threading conflicts:** - - - **Scenario:** - - - The `fork` process spawning method (default on Linux) conflicts with threaded libraries. - - - **Possible solution:** - - - Switch to the `spawn` method for process pool creation. [Learn more about process spawning.](./performance#normalize) - -### Schema inference errors - -Complex or inconsistent data structures can cause schema inference failures. - -### Common scenarios and possible solutions - -1. **Inconsistent data types:** - - - **Scenario:** - ```py - # First pipeline run - data_run_1 = [ - {"id": 1, "value": 42}, # value is integer - {"id": 2, "value": 123} - ] - - # Second pipeline run - data_run_2 = [ - {"id": 3, "value": "high"}, # value changes to text - {"id": 4, "value": "low"} - ] - - # Third pipeline run - data_run_3 = [ - {"id": 5, "value": 789}, # back to integer - {"id": 6, "value": "medium"} # mixed types - ] - ``` - - - **Impact:** - - - Original column remains as is. - - New variant column `value__v_text` created for text values. - - May require additional data handling in downstream processes. - - - **Possible solutions:** - - - Enforce Type Consistency - - You can enforce type consistency using the `apply_hints` method. This ensures all values in a column follow the specified data type. - - ```py - # Assuming 'resource' is your data resource - resource.apply_hints(columns={ - "value": {"data_type": "text"}, # Enforce 'value' to be of type 'text' - }) - ``` - - - In this example, the `value` column is always treated as text, even if the original data contains integers or mixed types. - - - Handle multiple types with separate columns. - - The `dlt` library automatically handles mixed data types by creating variant columns. If a column contains different data types, `dlt` generates a separate column for each type. - - For example, if a column named `value` contains both integers and strings, `dlt` creates a new column called `value__v_text` for the string values. - - After processing multiple runs, the schema will be: - - ```text - | name | data_type | nullable | - |---------------|---------------|----------| - | id | bigint | true | - | value | bigint | true | - | value__v_text | text | true | - ``` - - - Use Type validation **to Ensure Consistency** - - When processing pipeline runs with mixed data types, type validation can be applied to enforce strict type rules. - - - **Example:** - ```py - def validate_value(value): - if not isinstance(value, (int, str)): # Allow only integers and strings - raise TypeError(f"Invalid type: {type(value)}. Expected int or str.") - return str(value) # Convert all values to a consistent type (e.g., text) - - # First pipeline run - data_run_1 = [{"id": 1, "value": validate_value(42)}, - {"id": 2, "value": validate_value(123)}] - - # Second pipeline run - data_run_2 = [{"id": 3, "value": validate_value("high")}, - {"id": 4, "value": validate_value("low")}] - - # Third pipeline run - data_run_3 = [{"id": 7, "value": validate_value([1, 2, 3])}] - ``` - - In this example, data_run_4 contains an invalid value (a list) instead of an integer or string. When the pipeline runs with data_run_4, the validate_value function raises a TypeError. - -2. **Nested data challenges:** - - - **Scenario:** - - - Issues arise due to deep nesting, inconsistent nesting, or unsupported types. - - - **Possible solution:** - - - Simplify nested structures or preprocess data. [Read about nested tables.](../general-usage/destination-tables#nested-tables) - - You can limit unnesting level with `max_table_nesting`. - -## Load stage - -Failures in the **Load** stage often relate to authentication issues, schema changes, datatype mismatches or memory problems. - -### Authentication and connection failures - -### Common scenarios and possible solutions - -- **Scenario:** - - - Incorrect credentials. - - Data loading is interrupted due to connection issues or database downtime. This may leave some tables partially loaded or completely empty, halting the pipeline process. - -- **Possible solution:** - - - Verify credentials and follow proper setup instructions. [Credential setup guide.](../general-usage/credentials/setup) - - If the connection is restored, you can resume the load process using the `pipeline.load()` method. This ensures the pipeline picks up from where it stopped, reloading any remaining data packages. - - If data was **partially loaded**, check the `dlt_loads` table. If a `load_id` is missing from this table, it means the corresponding load **failed**. You can then remove partially loaded data by deleting any records associated with `load_id` values that do not exist in `dlt_loads`. [More details here](../general-usage/destination-tables#load-packages-and-load-ids). - -### Schema changes (e.g., column renaming, Datatype mismatches) - -### Common scenarios and possible solutions - -- **Scenario:** - - - Renamed columns create variant columns in the destination schema. - - Incoming datatypes that the destination doesn’t support result in variant columns. - -- **Possible solution:** - - - Use schema evolution to handle column renaming. [Read more about schema evolution.](../general-usage/schema-evolution#evolving-the-schema) - -### `FileNotFoundError` for 'schema_updates.json' in parallel runs - -- **Scenario** - When running the same pipeline name multiple times in parallel (e.g., via Airflow), `dlt` may fail at the load stage with an error like: - - > `FileNotFoundError: schema_updates.json not found` - - This happens because `schema_updates.json` is generated during normalization. Concurrent runs using the same pipeline name may overwrite or lock access to this file, causing failures. - -- **Possible Solutions** - - 1. **Use unique pipeline names for each parallel run** - - If calling `pipeline.run()` multiple times within the same workflow (e.g., once per resource), assign a unique `pipeline_name` for each run. This ensures separate working directories, preventing file conflicts. - - 2. **Leverage dlt’s concurrency management or Airflow helpers** - - dlt’s Airflow integration “serializes” resources into separate tasks while safely handling concurrency. To parallelize resource extraction without file conflicts, use: - ```py - decompose="serialize" - ``` - More details are available in the [Airflow documentation](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-valueerror-can-only-decompose-dlt-source). - - 3. **Disable dev mode to prevent multiple destination datasets** - - When `dev_mode=True`, dlt generates unique dataset names (`_`) for each run. To maintain a consistent dataset, set: - ```py - dev_mode=False - ``` - Read more about this in the [dev mode documentation](../general-usage/pipeline#do-experiments-with-dev-mode). - -### Memory management issues - -- **Scenario:** - - - Loading large datasets without file rotation enabled. This would make dlt try to upload a huge data set into destination at once. *(Note: Rotation is disabled by default.)* - -- **Impact:** - - - Pipeline failures due to out-of-memory errors. - -- **Possible Solution:** - - - Enable file rotation. [Read more about it here.](./performance#controlling-intermediary-file-size-and-rotation) - -By identifying potential failure scenarios and applying the suggested mitigation strategies, you can ensure reliable and efficient pipeline performance. \ No newline at end of file diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index 8432834706..d4fd9d4341 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -463,7 +463,6 @@ const sidebars = { 'dlt-ecosystem/table-formats/iceberg', ] }, - 'reference/troubleshooting', 'reference/frequently-asked-questions', ], },