From 09dc5107696cd4b09bf5cb38c9f7e252d8cf15c9 Mon Sep 17 00:00:00 2001 From: dat-a-man <98139823+dat-a-man@users.noreply.github.com> Date: Sun, 23 Feb 2025 09:33:23 +0000 Subject: [PATCH] Updated for filenotfound error --- .../website/docs/reference/troubleshooting.md | 35 +++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/reference/troubleshooting.md b/docs/website/docs/reference/troubleshooting.md index a974d696b7..4be54ab679 100644 --- a/docs/website/docs/reference/troubleshooting.md +++ b/docs/website/docs/reference/troubleshooting.md @@ -181,7 +181,7 @@ Timestamp issues occur when formats are incompatible with the destination or inc - Standardize timestamp formats across all runs to maintain consistent schema inference and avoid the creation of variant columns. -3. Inconsistent formats for incremental loading +3. **Inconsistent formats for incremental loading** - **Scenario:** @@ -402,6 +402,37 @@ Failures in the **Load** stage often relate to authentication issues, schema cha - Use schema evolution to handle column renaming. [Read more about schema evolution.](../general-usage/schema-evolution#evolving-the-schema) +### **`FileNotFoundError` for 'schema_updates.json' in parallel runs** + +- **Scenario** + When running the same pipeline name multiple times in parallel (e.g., via Airflow), `dlt` may fail at the load stage with an error like: + + > `FileNotFoundError: schema_updates.json not found` + + This happens because `schema_updates.json` is generated during normalization. Concurrent runs using the same pipeline name may overwrite or lock access to this file, causing failures. + +- **Possible Solutions** + + 1. **Use unique pipeline names for each parallel run** + + If calling `pipeline.run()` multiple times within the same workflow (e.g., once per resource), assign a unique `pipeline_name` for each run. This ensures separate working directories, preventing file conflicts. + + 2. **Leverage dlt’s concurrency management or Airflow helpers** + + dlt’s Airflow integration “serializes” resources into separate tasks while safely handling concurrency. To parallelize resource extraction without file conflicts, use: + ```py + decompose="serialize" + ``` + More details are available in the [Airflow documentation](../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-valueerror-can-only-decompose-dlt-source). + + 3. **Disable dev mode to prevent multiple destination datasets** + + When `dev_mode=True`, dlt generates unique dataset names (`_`) for each run. To maintain a consistent dataset, set: + ```py + dev_mode=False + ``` + Read more about this in the [dev mode documentation](../general-usage/pipeline#do-experiments-with-dev-mode). + ### Memory management issues - **Scenario:** @@ -412,7 +443,7 @@ Failures in the **Load** stage often relate to authentication issues, schema cha - Pipeline failures due to out-of-memory errors. -- **Solution:** +- **Possible Solution:** - Enable file rotation. [Read more about it here.](./performance#controlling-intermediary-file-size-and-rotation)