Tasks Stuck in Running
state after execution complete
#43174
Replies: 1 comment 1 reply
-
I finally figured out that the issue was in our success_callback which makes a call to the pagerduty API with a resolve event. We only want to do this if it's a successful retry. In version 2.6.3, the |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We just upgraded from version 2.6.3 to 2.10.2. After the upgrade we're seeing much longer run times for DAGs during peak hours (we have thousands of tasks that kick off every 4 hours). During non-peak hours, the DAGs seem to have comparable run times to pre-upgrade.
While the DAG is still running, we will see a task which typically completes in <1 minute appear to be stuck in
running
state for 40-50 minutes, but the logs show that the execution is complete. Eventually the task is marked assuccess
and then it will show a run time of the expected ~30 seconds.Here are the logs for the task in the Post-Execution portion
You can see a 50 minute period between the logs in our success_callback and the local_task_job_runner marking the task as a success.
I can also see in the Event log that the Success log was written at 8:55
This isn't happening for all tasks and it's not consistent which ones it's happening with, but we've seen the behavior with several tasks across several DAGs. This seems to be impacting concurrency across the instance because these tasks are taking up slots in the Pool while not actually doing anything.
I'm curious if anyone has tips on how to fix this or other places to look?
As a next step and bandaid, I'm planning to add
execution_timeout
values for these tasks, but that is only possible for a few types of tasks we have - the impacted tasks currently seem to fall into that bucket.Details about deployment:
Executor: CeleryKubernetesExecutor - impacted tasks seem to all be running on Celery workers
Metadata Database: AWS RDS MySql
Airflow Helm chart: 1.10 (we're upgrading to 1.15 now)
Beta Was this translation helpful? Give feedback.
All reactions