Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added doc on troubleshooting #2304

Open
wants to merge 7 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions docs/website/docs/general-usage/schema-evolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,3 +213,80 @@ Demonstrating schema evolution without talking about schema and data contracts i

Schema and data contracts can be applied to entities such as ‘tables’, ‘columns’, and ‘data_types’ using contract modes such as ‘evolve’, ‘freeze’, ‘discard_rows’, and ‘discard_columns’ to tell dlt how to apply contracts for a particular entity. To read more about **schema and data contracts**, read our [documentation](./schema-contracts).

## Troubleshooting
This section addresses common schema evolution issues.

1. #### Inconsistent data types:
- Data sources that vary in data type between pipeline runs may result in additional variant columns and may require extra handling. For example, consider the following pipeline runs:
```py
# First pipeline run: "value" is an integer
data_run_1 = [
{"id": 1, "value": 42},
{"id": 2, "value": 123}
]

# Second pipeline run: "value" changes to text
data_run_2 = [
{"id": 3, "value": "high"},
{"id": 4, "value": "low"}
]

# Third pipeline run: Mixed types in "value"
data_run_3 = [
{"id": 5, "value": 789}, # back to integer
{"id": 6, "value": "medium"} # mixed types
]
```

- As a result, the original column remains unchanged and a new variant column value__v_text is created for text values, requiring downstream processes to handle both columns appropriately.

- **Recommended solutions:**
- **Enforce Type consistency**
- You can enforce type consistency using the `apply_hints` method. This ensure that all values in the column adhere to a specified data type. For example:

```py
# Assuming 'resource' is your data resource
resource.apply_hints(columns={
"value": {"data_type": "text"}, # Enforce 'value' to be of type 'text'
})
```
In this example, the `value` column is always treated as text, even if the original data contains integers or mixed types.

- **Handle multiple types with separate columns**
- The dlt library automatically handles mixed data types by creating variant columns. If a column contains different data types, dlt generates a separate column for each type.
- For example, if a column named `value` contains both integers and strings, dlt creates a new column called `value__v_text` for the string values.
- After processing multiple runs, the schema will be:
```text
| name | data_type | nullable |
|--------------|--------------|----------|
| id | bigint | true |
| value | bigint | true |
| value__v_text| text | true |
```

- **Apply Type validation**
- Validate incoming data to ensure that only expected types are processed. For example:
```py
def validate_value(value):
if not isinstance(value, (int, str)): # Allow only integers and strings
raise TypeError(f"Invalid type: {type(value)}. Expected int or str.")
return str(value) # Convert all values to a consistent type (e.g., text)

# First pipeline run
data_run_1 = [{"id": 1, "value": validate_value(42)},
{"id": 2, "value": validate_value(123)}]

# Second pipeline run
data_run_2 = [{"id": 3, "value": validate_value("high")},
{"id": 4, "value": validate_value("low")}]

# Third pipeline run
data_run_3 = [{"id": 7, "value": validate_value([1, 2, 3])}]
```

In this example, `data_run_3` contains an invalid value (a list) instead of an integer or string. When the pipeline runs with `data_run_3`, the `validate_value` function raises a `TypeError`.

#### 2. Nested data challenges:
- Issues arise due to deep nesting, inconsistent nesting, or unsupported types.

- To avoid this, you can simplify nested structures or preprocess data [see nested tables](../general-usage/destination-tables#nested-tables) or limit the unnesting level with max_table_nesting.
56 changes: 56 additions & 0 deletions docs/website/docs/general-usage/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,3 +441,59 @@ def textual(nesting_level: int):
return dlt.resource([])
```

## Troubleshooting

This section addresses common datatype issues.

### Unsupported timestamps and format issues

Timestamp issues can occur when the formats are incompatible with the destination or when they change inconsistently between pipeline runs.

#### 1. Unsupported formats or features
- Combining `precision` and `timezone` in timestamps causes errors in specific destinations (e.g., DuckDB).
- You can simplify the timestamp format to exclude unsupported features. For example:

```py
import dlt

@dlt.resource(
columns={"event_tstamp": {"data_type": "timestamp", "precision": 3, "timezone": False}},
primary_key="event_id",
)
def events():
yield [{"event_id": 1, "event_tstamp": "2024-07-30T10:00:00.123+00:00"}]

pipeline = dlt.pipeline(destination="duckdb")
pipeline.run(events())
```

#### 2. Inconsistent formats across runs
- Different pipeline runs use varying timestamp formats (e.g., `YYYY-MM-DD HH:MM:SS` vs. ISO 8601 vs. non-standard formats).
- As a result, the destination (e.g., BigQuery) might infer the timestamp column in one run, but later runs with incompatible formats (like `20-08-2024` or `04th of January 2024`) result in the creation of variant columns (e.g., `end_date__v_text`).
- It is best practice to standardize timestamp formats across all pipeline runs to maintain consistent column datatype inference.

#### 3. Inconsistent formats for incremental loading
- Data source returns string timestamps but incremental loading is configured with an integer timestamp value.
- Example:
```py
# API response
data = [
{"id": 1, "name": "Item 1", "created_at": "2024-01-01 00:00:00"},
]

# Incorrect configuration (type mismatch)
@dlt.resource(primary_key="id")
def my_data(
created_at=dlt.sources.incremental(
"created_at",
initial_value= 9999
)
):
yield data
```
- This makes the pipeline fails with an `IncrementalCursorInvalidCoercion` error because it cannot compare an integer (`initial_value` of 9999) with a string timestamp. The error indicates a type mismatch between the expected and actual data formats.
- To solve this, you can:
- Use string timestamp for incremental loading.
- Convert source data using “add_map”.
- If you need to use timestamps for comparison but want to preserve the original format, create a separate column.

12 changes: 11 additions & 1 deletion docs/website/docs/reference/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,20 @@ Below, we set files to rotate after 100,000 items written or when the filesize e

<!--@@@DLT_SNIPPET ./performance_snippets/toml-snippets.toml::file_size_toml-->

:::note NOTE
When working with a single resource that handles a very large dataset, memory exhaustion may occur during processing. To mitigate this, enable file rotation by configuring `file_max_items` or `file_max_bytes` to split the data into smaller chunks and consider increasing the number of parallel workers for better processing. Read more about [parallel processing.](#parallelism-within-a-pipeline)
:::

### Disabling and enabling file compression
Several [text file formats](../dlt-ecosystem/file-formats/) have `gzip` compression enabled by default. If you wish that your load packages have uncompressed files (e.g., to debug the content easily), change `data_writer.disable_compression` in config.toml. The entry below will disable the compression of the files processed in the `normalize` stage.
<!--@@@DLT_SNIPPET ./performance_snippets/toml-snippets.toml::compression_toml-->

### Handling insufficient RAM for in-memory operations
If your available RAM is not sufficient for in-memory operations, consider these optimizations:

Adjust the `buffer_max_items` setting to fine-tune the size of in-memory buffers. This helps to prevent memory overconsumption when processing large datasets. For more details, [see the buffer configuration guide.](#controlling-in-memory-buffers)

For handling big data efficiently, process your data in **chunks** rather than loading it entirely into memory. This batching approach allows for more effective resource management and can significantly reduce memory usage.

### Freeing disk space after loading

Expand Down Expand Up @@ -197,7 +207,7 @@ The default is to not parallelize normalization and to perform it in the main pr
:::

:::note
Normalization is CPU-bound and can easily saturate all your cores. Never allow `dlt` to use all cores on your local machine.
Normalization is CPU-bound and can easily saturate all your cores if not configured properly. Too many workers may exhaust resources; too few may underutilize capacity. Never allow dlt to use all available cores on your local machine, adjust the worker settings in your `config.toml` accordingly.
:::

:::caution
Expand Down
22 changes: 22 additions & 0 deletions docs/website/docs/walkthroughs/run-a-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,28 @@ should tell you what went wrong.
The most probable cause of the failed job is **the data in the job file**. You can inspect the file
using the **JOB file path** provided.

### Exceeding API rate limits

If your pipeline triggers an HTTP `Error 429`, this means that the API has temporarily blocked your requests due to exceeding the allowed rate limits. Here are some steps to help you troubleshoot and resolve the issue:

- Ensure that your API credentials are set up correctly so that your requests are properly authenticated.

- Check the API’s guidelines on rate limits. Look for headers such as `Retry-After` in the response to determine how long you should wait before retrying.

- Use tools like `time.sleep()` or libraries such as `ratelimiter` to introduce delays between requests. This helps in staying within the allowed limits.

- Incorporate exponential backoff strategies in your code. This means if a request fails with a `429`, you wait for a short period and then try again, increasing the wait time on subsequent failures.

- Consider batching requests or caching results to reduce the number of API calls needed during your data load process.

### Connection failures

Data loading can be interrupted due to connection issues or database downtime. When this happens, some tables might be partially loaded or even empty, which halts the pipeline process.

- If the connection is restored, you can resume the load process using the `pipeline.load()` method. This method will pick up from where the previous load stopped and will reload any remaining data packages.

- In the event that data was partially loaded, check the `dlt_loads` table. If a specific `load_id` is missing from this table, it indicates that the corresponding load has failed. You can then remove any partially loaded data by deleting records associated with those `load_id` values that do not exist in `dlt_loads`. More details can be found in the destination [tables documentation.](../general-usage/destination-tables#load-packages-and-load-ids)

## Further readings

- [Beef up your script for production](../running-in-production/running.md), easily add alerting,
Expand Down