You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the KubernetesExecutor on AWS EKS, pods are left running despite successful task completion and {taskinstance.py:1345} INFO - Marking task as SUCCESS being logged. Majority of completed tasks logged {local_task_job_runner.py:225} INFO - Task exited with return code 0 and the pod shuts down but some don't and the pod never dies.
We are witnessing two processes on the base container of the pods: PID 7 is airflow & PID 45 is airflow task ru of which the latter is a zombie. On running kill -9 7 the pod shuts down, but kill -9 45 does nothing. PID 7 is the parent of PID 45 and its in a sleeping state.
[2023-08-17, 06:19:56 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-17T04:00:16.123917+00:00 [queued]>
[2023-08-17, 06:19:56 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-17T04:00:16.123917+00:00 [queued]>
[2023-08-17, 06:19:56 BST] {taskinstance.py:1308} INFO - Starting attempt 1 of 1
[2023-08-17, 06:19:56 BST] {taskinstance.py:1327} INFO - Executing <Task(S3ToRedshiftOperator): staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3> on 2023-08-17 04:00:16.123917+00:00
[2023-08-17, 06:19:56 BST] {standard_task_runner.py:57} INFO - Started process 45 to run task
[2023-08-17, 06:19:56 BST] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'ep-etl', 'staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3', 'manual__2023-08-17T04:00:16.123917+00:00', '--job-id', '517640', '--raw', '--subdir', 'DAGS_FOLDER/ep_etl.py', '--cfg-path', '/tmp/tmpx_hvh8by']
[2023-08-17, 06:19:56 BST] {standard_task_runner.py:85} INFO - Job 517640: Subtask staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3
[2023-08-17, 06:19:56 BST] {task_command.py:410} INFO - Running <TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-17T04:00:16.123917+00:00 [running]> on host ep-etl-staging-load-for-ep-tables-with-chunking-import-ep-sdi-s
[2023-08-17, 06:19:56 BST] {pod_generator.py:529} WARNING - Model file /opt/airflow/pod_templates/pod_template_file.yaml does not exist
[2023-08-17, 06:19:56 BST] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='ep-etl' AIRFLOW_CTX_TASK_ID='staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3' AIRFLOW_CTX_EXECUTION_DATE='2023-08-17T04:00:16.123917+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-08-17T04:00:16.123917+00:00'
[2023-08-17, 06:19:56 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-17, 06:19:56 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-17, 06:19:56 BST] {connection_wrapper.py:340} INFO - AWS Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from login and password.
[2023-08-17, 06:19:56 BST] {s3_to_redshift.py:192} INFO - Executing COPY command...
[2023-08-17, 06:19:56 BST] {base.py:73} INFO - Using connection ID 'redshift' for task execution.
[2023-08-17, 06:19:56 BST] {sql.py:374} INFO - Running statement: BEGIN;, parameters: None
[2023-08-17, 06:19:56 BST] {sql.py:374} INFO - Running statement: DELETE FROM **TABLE**;, parameters: None
[2023-08-17, 06:19:57 BST] {sql.py:383} INFO - Rows affected: x
[2023-08-17, 06:19:57 BST] {sql.py:374} INFO - Running statement:
***SQL STATEMENT***
, parameters: None
[2023-08-17, 06:20:33 BST] {sql.py:374} INFO - Running statement: COMMIT, parameters: None
[2023-08-17, 06:20:35 BST] {s3_to_redshift.py:197} INFO - COPY command complete...
[2023-08-17, 06:20:35 BST] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=ep-etl, task_id=staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3, execution_date=20230817T040016, start_date=20230817T051956, end_date=20230817T052035
[2023-08-17, 06:20:35 BST] {local_task_job_runner.py:225} INFO - Task exited with return code 0
[2023-08-17, 06:20:35 BST] {taskinstance.py:2653} INFO - 1 downstream tasks scheduled from follow-on schedule check
The following day, the same task ran:
[2023-08-18, 06:18:49 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-18T04:00:16.597146+00:00 [queued]>
[2023-08-18, 06:18:49 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-18T04:00:16.597146+00:00 [queued]>
[2023-08-18, 06:18:49 BST] {taskinstance.py:1308} INFO - Starting attempt 1 of 1
[2023-08-18, 06:18:49 BST] {taskinstance.py:1327} INFO - Executing <Task(S3ToRedshiftOperator): staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3> on 2023-08-18 04:00:16.597146+00:00
[2023-08-18, 06:18:49 BST] {standard_task_runner.py:57} INFO - Started process 45 to run task
[2023-08-18, 06:18:49 BST] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'ep-etl', 'staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3', 'manual__2023-08-18T04:00:16.597146+00:00', '--job-id', '520839', '--raw', '--subdir', 'DAGS_FOLDER/ep_etl.py', '--cfg-path', '/tmp/tmpvv0hxe1f']
[2023-08-18, 06:18:49 BST] {standard_task_runner.py:85} INFO - Job 520839: Subtask staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3
[2023-08-18, 06:18:49 BST] {task_command.py:410} INFO - Running <TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-18T04:00:16.597146+00:00 [running]> on host ep-etl-staging-load-for-ep-tables-with-chunking-import-ep-sdi-s
[2023-08-18, 06:18:49 BST] {pod_generator.py:529} WARNING - Model file /opt/airflow/pod_templates/pod_template_file.yaml does not exist
[2023-08-18, 06:18:49 BST] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='ep-etl' AIRFLOW_CTX_TASK_ID='staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3' AIRFLOW_CTX_EXECUTION_DATE='2023-08-18T04:00:16.597146+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-08-18T04:00:16.597146+00:00'
[2023-08-18, 06:18:49 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-18, 06:18:49 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-18, 06:18:49 BST] {connection_wrapper.py:340} INFO - AWS Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from login and password.
[2023-08-18, 06:18:49 BST] {s3_to_redshift.py:192} INFO - Executing COPY command...
[2023-08-18, 06:18:49 BST] {base.py:73} INFO - Using connection ID 'redshift' for task execution.
[2023-08-18, 06:18:50 BST] {sql.py:374} INFO - Running statement: BEGIN;, parameters: None
[2023-08-18, 06:18:50 BST] {sql.py:374} INFO - Running statement: DELETE FROM **TABLE**;, parameters: None
[2023-08-18, 06:18:50 BST] {sql.py:383} INFO - Rows affected: x
[2023-08-18, 06:18:50 BST] {sql.py:374} INFO - Running statement:
***SQL STATEMENT***
, parameters: None
[2023-08-18, 06:19:24 BST] {sql.py:374} INFO - Running statement: COMMIT, parameters: None
[2023-08-18, 06:19:27 BST] {s3_to_redshift.py:197} INFO - COPY command complete...
[2023-08-18, 06:19:27 BST] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=ep-etl, task_id=staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3, execution_date=20230818T040016, start_date=20230818T051849, end_date=20230818T051927
The exact same task performing the same work but this time no exit code 0 and the pod is still running 8 hours after completion.
We see the same behaviour sporadically with all the different operators we run: SQLExecuteQueryOperator, S3ToRedshiftOperator, SqlToS3Operator, TriggerDagRunOperator, etc.
Any ideas or suggestions for further debugging would be greatly appreciated!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Background
Using the
KubernetesExecutor
on AWS EKS, pods are left running despite successful task completion and{taskinstance.py:1345} INFO - Marking task as SUCCESS
being logged. Majority of completed tasks logged{local_task_job_runner.py:225} INFO - Task exited with return code 0
and the pod shuts down but some don't and the pod never dies.We are witnessing two processes on the
base
container of the pods: PID 7 isairflow
& PID 45 isairflow task ru
of which the latter is azombie
. On runningkill -9 7
the pod shuts down, butkill -9 45
does nothing. PID 7 is the parent of PID 45 and its in asleeping
state.Versions
Airflow: 2.6.3
EKS: v1.25.12-eks-2d98532
Libraries:
apache-airflow-providers-amazon==8.5.0
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.2
apache-airflow-providers-docker==3.7.1
apache-airflow-providers-elasticsearch==4.5.1
apache-airflow-providers-ftp==3.4.2
apache-airflow-providers-google==10.2.0
apache-airflow-providers-grpc==3.2.1
apache-airflow-providers-hashicorp==3.4.1
apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2
apache-airflow-providers-microsoft-azure==6.1.2
apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-redis==3.2.1
apache-airflow-providers-sendgrid==3.2.1
apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.1
apache-airflow-providers-snowflake==4.2.0
apache-airflow-providers-sqlite==3.4.2
apache-airflow-providers-ssh==3.7.1
apache-airflow-providers-tableau==4.2.1
OS: Debian GNU/Linux 11 (bullseye)
Deployment: Official helm chart 1.10
Logs
A successful task run:
The following day, the same task ran:
The exact same task performing the same work but this time no exit code 0 and the pod is still running 8 hours after completion.
We see the same behaviour sporadically with all the different operators we run: SQLExecuteQueryOperator, S3ToRedshiftOperator, SqlToS3Operator, TriggerDagRunOperator, etc.
Any ideas or suggestions for further debugging would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions