EKS pods left running after task processes do not exit #33496

hugocity · 2023-08-18T13:42:08Z

hugocity
Aug 18, 2023

Background

Using the KubernetesExecutor on AWS EKS, pods are left running despite successful task completion and {taskinstance.py:1345} INFO - Marking task as SUCCESS being logged. Majority of completed tasks logged {local_task_job_runner.py:225} INFO - Task exited with return code 0 and the pod shuts down but some don't and the pod never dies.

We are witnessing two processes on the base container of the pods: PID 7 is airflow & PID 45 is airflow task ru of which the latter is a zombie. On running kill -9 7 the pod shuts down, but kill -9 45 does nothing. PID 7 is the parent of PID 45 and its in a sleeping state.

Versions

Airflow: 2.6.3
EKS: v1.25.12-eks-2d98532

Libraries:
apache-airflow-providers-amazon==8.5.0
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.2
apache-airflow-providers-docker==3.7.1
apache-airflow-providers-elasticsearch==4.5.1
apache-airflow-providers-ftp==3.4.2
apache-airflow-providers-google==10.2.0
apache-airflow-providers-grpc==3.2.1
apache-airflow-providers-hashicorp==3.4.1
apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2
apache-airflow-providers-microsoft-azure==6.1.2
apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-redis==3.2.1
apache-airflow-providers-sendgrid==3.2.1
apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.1
apache-airflow-providers-snowflake==4.2.0
apache-airflow-providers-sqlite==3.4.2
apache-airflow-providers-ssh==3.7.1
apache-airflow-providers-tableau==4.2.1

OS: Debian GNU/Linux 11 (bullseye)

Deployment: Official helm chart 1.10

Logs

A successful task run:

[2023-08-17, 06:19:56 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-17T04:00:16.123917+00:00 [queued]>
[2023-08-17, 06:19:56 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-17T04:00:16.123917+00:00 [queued]>
[2023-08-17, 06:19:56 BST] {taskinstance.py:1308} INFO - Starting attempt 1 of 1
[2023-08-17, 06:19:56 BST] {taskinstance.py:1327} INFO - Executing <Task(S3ToRedshiftOperator): staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3> on 2023-08-17 04:00:16.123917+00:00
[2023-08-17, 06:19:56 BST] {standard_task_runner.py:57} INFO - Started process 45 to run task
[2023-08-17, 06:19:56 BST] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'ep-etl', 'staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3', 'manual__2023-08-17T04:00:16.123917+00:00', '--job-id', '517640', '--raw', '--subdir', 'DAGS_FOLDER/ep_etl.py', '--cfg-path', '/tmp/tmpx_hvh8by']
[2023-08-17, 06:19:56 BST] {standard_task_runner.py:85} INFO - Job 517640: Subtask staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3
[2023-08-17, 06:19:56 BST] {task_command.py:410} INFO - Running <TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-17T04:00:16.123917+00:00 [running]> on host ep-etl-staging-load-for-ep-tables-with-chunking-import-ep-sdi-s
[2023-08-17, 06:19:56 BST] {pod_generator.py:529} WARNING - Model file /opt/airflow/pod_templates/pod_template_file.yaml does not exist
[2023-08-17, 06:19:56 BST] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='ep-etl' AIRFLOW_CTX_TASK_ID='staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3' AIRFLOW_CTX_EXECUTION_DATE='2023-08-17T04:00:16.123917+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-08-17T04:00:16.123917+00:00'
[2023-08-17, 06:19:56 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-17, 06:19:56 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-17, 06:19:56 BST] {connection_wrapper.py:340} INFO - AWS Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from login and password.
[2023-08-17, 06:19:56 BST] {s3_to_redshift.py:192} INFO - Executing COPY command...
[2023-08-17, 06:19:56 BST] {base.py:73} INFO - Using connection ID 'redshift' for task execution.
[2023-08-17, 06:19:56 BST] {sql.py:374} INFO - Running statement: BEGIN;, parameters: None
[2023-08-17, 06:19:56 BST] {sql.py:374} INFO - Running statement: DELETE FROM **TABLE**;, parameters: None
[2023-08-17, 06:19:57 BST] {sql.py:383} INFO - Rows affected: x
[2023-08-17, 06:19:57 BST] {sql.py:374} INFO - Running statement: 
                    ***SQL STATEMENT***
        , parameters: None
[2023-08-17, 06:20:33 BST] {sql.py:374} INFO - Running statement: COMMIT, parameters: None
[2023-08-17, 06:20:35 BST] {s3_to_redshift.py:197} INFO - COPY command complete...
[2023-08-17, 06:20:35 BST] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=ep-etl, task_id=staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3, execution_date=20230817T040016, start_date=20230817T051956, end_date=20230817T052035
[2023-08-17, 06:20:35 BST] {local_task_job_runner.py:225} INFO - Task exited with return code 0
[2023-08-17, 06:20:35 BST] {taskinstance.py:2653} INFO - 1 downstream tasks scheduled from follow-on schedule check

The following day, the same task ran:

[2023-08-18, 06:18:49 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-18T04:00:16.597146+00:00 [queued]>
[2023-08-18, 06:18:49 BST] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-18T04:00:16.597146+00:00 [queued]>
[2023-08-18, 06:18:49 BST] {taskinstance.py:1308} INFO - Starting attempt 1 of 1
[2023-08-18, 06:18:49 BST] {taskinstance.py:1327} INFO - Executing <Task(S3ToRedshiftOperator): staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3> on 2023-08-18 04:00:16.597146+00:00
[2023-08-18, 06:18:49 BST] {standard_task_runner.py:57} INFO - Started process 45 to run task
[2023-08-18, 06:18:49 BST] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'ep-etl', 'staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3', 'manual__2023-08-18T04:00:16.597146+00:00', '--job-id', '520839', '--raw', '--subdir', 'DAGS_FOLDER/ep_etl.py', '--cfg-path', '/tmp/tmpvv0hxe1f']
[2023-08-18, 06:18:49 BST] {standard_task_runner.py:85} INFO - Job 520839: Subtask staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3
[2023-08-18, 06:18:49 BST] {task_command.py:410} INFO - Running <TaskInstance: ep-etl.staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3 manual__2023-08-18T04:00:16.597146+00:00 [running]> on host ep-etl-staging-load-for-ep-tables-with-chunking-import-ep-sdi-s
[2023-08-18, 06:18:49 BST] {pod_generator.py:529} WARNING - Model file /opt/airflow/pod_templates/pod_template_file.yaml does not exist
[2023-08-18, 06:18:49 BST] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='ep-etl' AIRFLOW_CTX_TASK_ID='staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3' AIRFLOW_CTX_EXECUTION_DATE='2023-08-18T04:00:16.597146+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-08-18T04:00:16.597146+00:00'
[2023-08-18, 06:18:49 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-18, 06:18:49 BST] {base.py:73} INFO - Using connection ID 's3_conn' for task execution.
[2023-08-18, 06:18:49 BST] {connection_wrapper.py:340} INFO - AWS Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from login and password.
[2023-08-18, 06:18:49 BST] {s3_to_redshift.py:192} INFO - Executing COPY command...
[2023-08-18, 06:18:49 BST] {base.py:73} INFO - Using connection ID 'redshift' for task execution.
[2023-08-18, 06:18:50 BST] {sql.py:374} INFO - Running statement: BEGIN;, parameters: None
[2023-08-18, 06:18:50 BST] {sql.py:374} INFO - Running statement: DELETE FROM **TABLE**;, parameters: None
[2023-08-18, 06:18:50 BST] {sql.py:383} INFO - Rows affected: x
[2023-08-18, 06:18:50 BST] {sql.py:374} INFO - Running statement: 
                    ***SQL STATEMENT***
        , parameters: None
[2023-08-18, 06:19:24 BST] {sql.py:374} INFO - Running statement: COMMIT, parameters: None
[2023-08-18, 06:19:27 BST] {s3_to_redshift.py:197} INFO - COPY command complete...
[2023-08-18, 06:19:27 BST] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=ep-etl, task_id=staging-load-for-ep-tables-with-chunking.import-ep-sdi-scan-data-into-redshift-from-s3, execution_date=20230818T040016, start_date=20230818T051849, end_date=20230818T051927

The exact same task performing the same work but this time no exit code 0 and the pod is still running 8 hours after completion.

We see the same behaviour sporadically with all the different operators we run: SQLExecuteQueryOperator, S3ToRedshiftOperator, SqlToS3Operator, TriggerDagRunOperator, etc.

Any ideas or suggestions for further debugging would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS pods left running after task processes do not exit #33496

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

EKS pods left running after task processes do not exit #33496

hugocity Aug 18, 2023

Replies: 0 comments

hugocity
Aug 18, 2023