dlt-hub · dat-a-man · Feb 14, 2025 · Feb 16, 2025 · Feb 18, 2025 · Feb 23, 2025
diff --git a/docs/website/docs/general-usage/schema-evolution.md b/docs/website/docs/general-usage/schema-evolution.md
@@ -213,3 +213,80 @@ Demonstrating schema evolution without talking about schema and data contracts i
 
 Schema and data contracts can be applied to entities such as ‘tables’, ‘columns’, and ‘data_types’ using contract modes such as ‘evolve’, ‘freeze’, ‘discard_rows’, and ‘discard_columns’ to tell dlt how to apply contracts for a particular entity. To read more about **schema and data contracts**, read our [documentation](./schema-contracts).
 
+## Troubleshooting
+This section addresses common schema evolution issues.
+
+1. #### Inconsistent data types:
+    - Data sources that vary in data type between pipeline runs may result in additional variant columns and may require extra handling. For example, consider the following pipeline runs:
+    ```py
+    # First pipeline run: "value" is an integer
+    data_run_1 = [
+        {"id": 1, "value": 42},              
+        {"id": 2, "value": 123}
+    ]
+
+    # Second pipeline run: "value" changes to text
+    data_run_2 = [
+        {"id": 3, "value": "high"},
+        {"id": 4, "value": "low"}
+    ]
+
+    # Third pipeline run: Mixed types in "value"
+    data_run_3 = [
+        {"id": 5, "value": 789},             # back to integer
+        {"id": 6, "value": "medium"}         # mixed types
+    ]
+    ```
+
+    - As a result, the original column remains unchanged and a new variant column value__v_text is created for text values, requiring downstream processes to handle both columns appropriately.
+
+    - **Recommended solutions:**
+        - **Enforce Type consistency**
+            - You can enforce type consistency using the `apply_hints` method. This ensure that all values in the column adhere to a specified data type. For example:
+
+            ```py
+            # Assuming 'resource' is your data resource
+            resource.apply_hints(columns={
+                "value": {"data_type": "text"},  # Enforce 'value' to be of type 'text'
+            })
+            ```
+            In this example, the `value` column is always treated as text, even if the original data contains integers or mixed types.
+
+        - **Handle multiple types with separate columns**
+            - The dlt library automatically handles mixed data types by creating variant columns. If a column contains different data types, dlt generates a separate column for each type.
+            - For example, if a column named `value` contains both integers and strings, dlt creates a new column called `value__v_text` for the string values.
+            - After processing multiple runs, the schema will be:
+              ```text
+              | name         | data_type    | nullable |
+              |--------------|--------------|----------|
+              | id           | bigint       | true     |
+              | value        | bigint       | true     |
+              | value__v_text| text         | true     |
+              ```
+
+        - **Apply Type validation**
+            - Validate incoming data to ensure that only expected types are processed. For example:
+            ```py
+            def validate_value(value):
+                if not isinstance(value, (int, str)):  # Allow only integers and strings
+                    raise TypeError(f"Invalid type: {type(value)}. Expected int or str.")
+                return str(value)  # Convert all values to a consistent type (e.g., text)
+
+            # First pipeline run
+            data_run_1 = [{"id": 1, "value": validate_value(42)},
+                          {"id": 2, "value": validate_value(123)}]
+
+            # Second pipeline run
+            data_run_2 = [{"id": 3, "value": validate_value("high")},
+                          {"id": 4, "value": validate_value("low")}]
+
+            # Third pipeline run
+            data_run_3 = [{"id": 7, "value": validate_value([1, 2, 3])}]
+            ```
+
+            In this example, `data_run_3` contains an invalid value (a list) instead of an integer or string. When the pipeline runs with `data_run_3`, the `validate_value` function raises a `TypeError`.
+
+#### 2. Nested data challenges:
+- Issues arise due to deep nesting, inconsistent nesting, or unsupported types.
+
+- To avoid this, you can simplify nested structures or preprocess data [see nested tables](../general-usage/destination-tables#nested-tables) or limit the unnesting level with max_table_nesting.
diff --git a/docs/website/docs/general-usage/schema.md b/docs/website/docs/general-usage/schema.md
@@ -441,3 +441,59 @@ def textual(nesting_level: int):
     return dlt.resource([])
 ```
 
+## Troubleshooting
+
+This section addresses common datatype issues.
+
+### Unsupported timestamps and format issues
+
+Timestamp issues can occur when the formats are incompatible with the destination or when they change inconsistently between pipeline runs.
+
+#### 1. Unsupported formats or features
+- Combining `precision` and `timezone` in timestamps causes errors in specific destinations (e.g., DuckDB).
+- You can simplify the timestamp format to exclude unsupported features. For example:
+
+  ```py
+  import dlt
+
+  @dlt.resource(
+      columns={"event_tstamp": {"data_type": "timestamp", "precision": 3, "timezone": False}},
+      primary_key="event_id",
+  )
+  def events():
+      yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}]
+
+  pipeline = dlt.pipeline(destination="duckdb")
+  pipeline.run(events())
+  ```
+
+#### 2. Inconsistent formats across runs
+- Different pipeline runs use varying timestamp formats (e.g., `YYYY-MM-DD HH:MM:SS` vs. ISO 8601 vs. non-standard formats).
+- As a result, the destination (e.g., BigQuery) might infer the timestamp column in one run, but later runs with incompatible formats (like `20-08-2024` or `04th of January 2024`) result in the creation of variant columns (e.g., `end_date__v_text`).
+- It is best practice to standardize timestamp formats across all pipeline runs to maintain consistent column datatype inference.
+
+#### 3. Inconsistent formats for incremental loading
+- Data source returns string timestamps but incremental loading is configured with an integer timestamp value.
+  - Example:
+      ```py
+      # API response
+      data = [
+          {"id": 1, "name": "Item 1", "created_at": "2024-01-01 00:00:00"},
+      ]
+
+      # Incorrect configuration (type mismatch)
+      @dlt.resource(primary_key="id")
+      def my_data(
+          created_at=dlt.sources.incremental(
+              "created_at",
+              initial_value= 9999
+          )
+      ):
+          yield data 
+      ```
+- This makes the pipeline fails with an `IncrementalCursorInvalidCoercion` error because it cannot compare an integer (`initial_value` of 9999) with a string timestamp. The error indicates a type mismatch between the expected and actual data formats.
+- To solve this, you can:
+    - Use string timestamp for incremental loading.
+    - Convert source data using “add_map”.
+    - If you need to use timestamps for comparison but want to preserve the original format, create a separate column.
+
diff --git a/docs/website/docs/reference/performance.md b/docs/website/docs/reference/performance.md
@@ -122,10 +122,20 @@ Below, we set files to rotate after 100,000 items written or when the filesize e
 
 <!--@@@DLT_SNIPPET ./performance_snippets/toml-snippets.toml::file_size_toml-->
 
+:::note NOTE
+When working with a single resource that handles a very large dataset, memory exhaustion may occur during processing. To mitigate this, enable file rotation by configuring `file_max_items` or `file_max_bytes` to split the data into smaller chunks and consider increasing the number of parallel workers for better processing. Read more about [parallel processing.](#parallelism-within-a-pipeline)
+:::
+
 ### Disabling and enabling file compression
 Several [text file formats](../dlt-ecosystem/file-formats/) have `gzip` compression enabled by default. If you wish that your load packages have uncompressed files (e.g., to debug the content easily), change `data_writer.disable_compression` in config.toml. The entry below will disable the compression of the files processed in the `normalize` stage.
 <!--@@@DLT_SNIPPET ./performance_snippets/toml-snippets.toml::compression_toml-->
 
+### Handling insufficient RAM for in-memory operations
+If your available RAM is not sufficient for in-memory operations, consider these optimizations:
+
+Adjust the `buffer_max_items` setting to fine-tune the size of in-memory buffers. This helps to prevent memory overconsumption when processing large datasets. For more details, [see the buffer configuration guide.](#controlling-in-memory-buffers)
+
+For handling big data efficiently, process your data in **chunks** rather than loading it entirely into memory. This batching approach allows for more effective resource management and can significantly reduce memory usage.
 
 ### Freeing disk space after loading
 
@@ -197,7 +207,7 @@ The default is to not parallelize normalization and to perform it in the main pr
 :::
 
 :::note
-Normalization is CPU-bound and can easily saturate all your cores. Never allow `dlt` to use all cores on your local machine.
+Normalization is CPU-bound and can easily saturate all your cores if not configured properly. Too many workers may exhaust resources; too few may underutilize capacity. Never allow dlt to use all available cores on your local machine, adjust the worker settings in your `config.toml` accordingly.
 :::
 
 :::caution

diff --git a/docs/website/docs/walkthroughs/run-a-pipeline.md b/docs/website/docs/walkthroughs/run-a-pipeline.md
@@ -282,6 +282,28 @@ should tell you what went wrong.
 The most probable cause of the failed job is **the data in the job file**. You can inspect the file
 using the **JOB file path** provided.
 
+### Exceeding API rate limits
+
+If your pipeline triggers an HTTP `Error 429`, this means that the API has temporarily blocked your requests due to exceeding the allowed rate limits. Here are some steps to help you troubleshoot and resolve the issue:
+
+- Ensure that your API credentials are set up correctly so that your requests are properly authenticated.
+
+- Check the API’s guidelines on rate limits. Look for headers such as `Retry-After` in the response to determine how long you should wait before retrying.
+
+- Use tools like `time.sleep()` or libraries such as `ratelimiter` to introduce delays between requests. This helps in staying within the allowed limits.
+
+- Incorporate exponential backoff strategies in your code. This means if a request fails with a `429`, you wait for a short period and then try again, increasing the wait time on subsequent failures.
+
+- Consider batching requests or caching results to reduce the number of API calls needed during your data load process.
+
+### Connection failures
+
+Data loading can be interrupted due to connection issues or database downtime. When this happens, some tables might be partially loaded or even empty, which halts the pipeline process.
+
+- If the connection is restored, you can resume the load process using the `pipeline.load()` method. This method will pick up from where the previous load stopped and will reload any remaining data packages.
+
+- In the event that data was partially loaded, check the `dlt_loads` table. If a specific `load_id` is missing from this table, it indicates that the corresponding load has failed. You can then remove any partially loaded data by deleting records associated with those `load_id` values that do not exist in `dlt_loads`. More details can be found in the destination [tables documentation.](../general-usage/destination-tables#load-packages-and-load-ids)
+
 ## Further readings
 
 - [Beef up your script for production](../running-in-production/running.md), easily add alerting,