Tasks can stick in the "queue" state due to the re-launch of DagFileProcessorManager #36093
Replies: 7 comments
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
Question: Are you talking about standalone DAG File Processor or the one that runs in Scheduler? |
Beta Was this translation helpful? Give feedback.
-
I am talking about But I looked into |
Beta Was this translation helpful? Give feedback.
-
yes. I was just checking.
Yeah I think this edge case needs to be investigated and fixed. |
Beta Was this translation helpful? Give feedback.
-
So, can we divide this problem into two parts?
|
Beta Was this translation helpful? Give feedback.
-
Could be. Possibly someone who knows that part better can chime in and comment. |
Beta Was this translation helpful? Give feedback.
-
If this is impacting you, you can use stalled_task_timeout if you're using CeleryExecutor or worker_pods_pending_timeout if you're using KubernetesExecutor. Or upgrade to 2.6+ and use task_queued_timeout regardless of executor. |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Airflow 2.4.2 (but I suppose other versions too)
I noticed that sometimes some DAGs' and tasks' callbacks don't work. At that time, some tasks hung in the "queue" state and needed manual restarts. After some investigation, I determined that it happens during the re-launching of DagFileProcessorManager.
Here's what I think happens:
What you think should happen instead
At least we can mark failed tasks with on_failure_callback and on_retry_callback at the scheduler, as we do for all other tasks. It prevents tasks from sticking.
But it would be great if other callbacks wouldn't be lost after the re-launching of DagFileProcessorManager.
How to reproduce
It's very hard to reproduce, but you can do something like that:
Add here this code
Then run:
Then check the dag_processor_manager log to see that
_callback_to_execute
has had a callback and then became empty.Operating System
CentOS Linux 7
Versions of Apache Airflow Providers
No response
Deployment
Virtualenv installation
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions