KubernetesOperator jobs fail -- [probably NOT] because "pod_template_file.yaml does not exist" #30805

hanleybrand · 2023-04-21T21:06:23Z

hanleybrand
Apr 21, 2023

Airflow 2.5.1; Helm chart 1.8.0

I had recently started working with KubernetesOperator to check it out, and initially my jobs were working as I expected, but recently they all fail and it looks like it's because the airflow pod_template is not being applied to the container when it's spun up. The main symptom was anything that relied on a k8s Secret failed -- and then I noticed this in the logs of failed jobs:

INFO - Running <TaskInstance: cd2_dr-get-schema-accounts.dap-get-schema-accounts scheduled__2023-03-03T05:00:00+00:00 [running]> on host cd2-dr-get-schema-accounts-dap-ec413a72aedc42d2b19f3edd1eaa18b2
{pod_generator.py:424} WARNING - Model file /opt/airflow/pod_templates/pod_template_file.yaml does not exist

Of course, I don't think I changed anything that would have changed this, but it's certainly possible that I did -- the big change I did recently (in the last few weeks) was realize that running airflow as uid/gid 1000/1000 was incorrect and I switched my image and airflow values.yaml to do the default of 50000/0 -- but I don't think this would have changed the pod_template_file from being included

The file exists on the scheduler pod at /opt/airflow/pod_templates/pod_template_file.yaml but not on the webserver or triggerer pods (I think this is correct though?) Regardless, I'm not explicitly doing anything with the pod_template, I just have what I think are the default values in my helm chart:

config:
    __stuff_deleted__: 'deleted irrelevant config attributes'
  celery_kubernetes_executor:
    kubernetes_queue: 'kubernetes'
  kubernetes_executor:
    namespace: '{{ .Release.Namespace }}'
    airflow_configmap: '{{ include "airflow_config" . }}'
    airflow_local_settings_configmap: '{{ include "airflow_config" . }}'
    pod_template_file: '{{ include "airflow_pod_template_file" . }}/pod_template_file.yaml'
    worker_container_repository: '{{ .Values.images.airflow.repository | default .Values.defaultAirflowRepository }}'
    worker_container_tag: '{{ .Values.images.airflow.tag | default .Values.defaultAirflowTag }}'
    multi_namespace_mode: '{{ ternary "True" "False" .Values.multiNamespaceMode }}'

So if the pod_template_file exists at /opt/airflow/pod_templates/pod_template_file.yaml on the scheduler (I'm not sure what causes it to not exist on the other pods, but I've been assuming this is part of the helm chart's workings), I don't know what piece is not able to find the file.

I checked the git history and the only change I've made since the K8sOperator jobs ran successfully was changing config.kubernetes to config.kubernetes_executor somewhere in there (to stop getting the error messages about how it had been changed), otherwise I think the values.yaml is pretty vanilla (no customization for the template, etc.)

Does this issue ring a bell with anyone?

Answered by potiuk

Apr 28, 2023

I guess normal. I do not know it all by heart - normally I'd have to go to source code to check. I have no idea what's wrong but you have not mentioned which component prints the error in their log. Maybe that will give you a clue - you wil be able to see whether this component has the file or not. I think the best way to see what's going on is to use helm install --dry-run if in doubt (and there is also a helm-diff plugin). That's the easiest way to see k8s resources created. It will print all the resources and you will be able to see what gets mounted where.

View full answer

potiuk · 2023-04-21T21:16:50Z

potiuk
Apr 21, 2023
Collaborator

airflow values.yaml to do the default of 50000/0 -- but I don't think this would have changed the pod_template_file from being included

Likely this user/group have no access to the file/any folder on the path.

4 replies

hanleybrand Apr 26, 2023
Author

Likely this user/group have no access to the file/any folder on the path.

Thanks for your response, and I apologize for the delay in mine

When you say the issue is likely that user (50000:0, I'll just call the user airflow from here on out) has no access on the path, do you mean /opt/airflow/pod_templates? I've been a little confused about this all day, my Dockerfile copies files as airflow, and also chowns the /opt/airflow directory to airflow:0, but it looks like all files created by the helm chart are owned by root:root

Is this normal? It seems odd to me, but I checked old builds and they are similar and the airflow deployment has been running fine as far as I can tell --

airflow@airflow-scheduler-84946586d-cbftc:/opt/airflow$  whoami; id
airflow
uid=50000(airflow) gid=0(root) groups=0(root)
airflow@airflow-scheduler-84946586d-cbftc:/opt/airflow$  ls -al /opt/airflow
total 60
drwxr-xr-x 1 airflow root 4096 Apr 26 20:00 .
drwxr-xr-x 1 root    root 4096 Apr 26 18:26 ..
-rw-r--r-- 1 root    root 1846 Apr 26 20:00 airflow.cfg
drwxr-xr-x 2 root    root 4096 Apr 26 20:00 config
drwxrwsrwx 4 root    root 4096 Apr 26 20:00 dags
drwxrwsrwx 4 root    root 4096 Apr 26 20:01 logs
drwxr-xr-x 1 airflow root 4096 Apr 26 19:58 plugins
drwxr-xr-x 2 root    root 4096 Apr 26 20:00 pod_templates
drwxr-xr-x 2 airflow root 4096 Apr 26 19:58 scripts
drwxrwsrwt 3 root    root  840 Apr 26 20:00 secrets

Also, I checked all of the jobs on the instance and it looks like the pod_template_file.yaml does not exist warning is thrown in every successful run that I looked at, so I'm wondering if this warning is a bit of a red herring

potiuk Apr 28, 2023
Collaborator

I guess normal. I do not know it all by heart - normally I'd have to go to source code to check. I have no idea what's wrong but you have not mentioned which component prints the error in their log. Maybe that will give you a clue - you wil be able to see whether this component has the file or not. I think the best way to see what's going on is to use helm install --dry-run if in doubt (and there is also a helm-diff plugin). That's the easiest way to see k8s resources created. It will print all the resources and you will be able to see what gets mounted where.

Answer selected by hanleybrand

hanleybrand Apr 28, 2023
Author

Ok, this is a good idea -- I'm going to mark this as the answer because I've begun to suspect the does-not-exist message is possibly a symptom, but not the cause of my job failing specifically -- thank you for your help!

pawelzajac-gc Jun 1, 2023

Hi @hanleybrand,
Did you get to the bottom of that?
I have the same error. The permissions were mentioned here, but those seem to be correct as airflow user belongs to gid=0 and there is file set to read for the group.

hanleybrand · 2023-06-19T20:18:48Z

hanleybrand
Jun 19, 2023
Author

Hi @hanleybrand, Did you get to the bottom of that? I have the same error. The permissions were mentioned here, but those seem to be correct as airflow user belongs to gid=0 and there is file set to read for the group.

I forgot to update this thread, I apologize -- after @potiuk answered, I decided to stop assuming the errors in the task log were the cause of the problem I was trying to solve ("KubernetesOperator jobs fail") and dug deeper, and it started looking like the failures were not failures as much as they were the scheduler and triggerer getting tripped up by something, and further delving made me begin to suspect that choosing SQL Server for my meta db wasn't the masterstroke I thought it was, and that ended up being correct as far as I can tell -- I changed my deployment to pg using Postgres Operator and since then a number of weird problems I was seeing seem to have been resolved.

I never did resolve the source of the error pod_template_file.yaml does not exist -- the tasks I wrote are running correctly, so it doesn't seem to be an actual issue. Unfortunately, nothing in my logs notes which pod is generating this part of the log (I had assumed it was either the scheduler or the triggerer but neither pod generates an error message that I can find).

I'm guessing it has to be one of them because the error occurs prior to the pod being built - this is the output from one of the \ kubernetes operator examples -- I had put it up back when I was having this problem to try to find my issue, but other than the error message in the task log, I can't find any errors in any other logs.

*** Reading remote log from s3://airflow-logs-dev/logs/dag_id=example_kubernetes_operator/run_id=scheduled__2023-06-19T19:00:00+00:00/task_id=write-xcom/attempt=1.log.
[2023-06-19, 16:00:11 EDT] {taskinstance.py:1083} INFO - Dependencies all met for <TaskInstance: example_kubernetes_operator.write-xcom scheduled__2023-06-19T19:00:00+00:00 [queued]>
[2023-06-19, 16:00:12 EDT] {taskinstance.py:1083} INFO - Dependencies all met for <TaskInstance: example_kubernetes_operator.write-xcom scheduled__2023-06-19T19:00:00+00:00 [queued]>
[2023-06-19, 16:00:12 EDT] {taskinstance.py:1279} INFO - 
--------------------------------------------------------------------------------
[2023-06-19, 16:00:12 EDT] {taskinstance.py:1280} INFO - Starting attempt 1 of 1
[2023-06-19, 16:00:12 EDT] {taskinstance.py:1281} INFO - 
--------------------------------------------------------------------------------
[2023-06-19, 16:00:12 EDT] {taskinstance.py:1300} INFO - Executing <Task(KubernetesPodOperator): write-xcom> on 2023-06-19 19:00:00+00:00
[2023-06-19, 16:00:12 EDT] {standard_task_runner.py:55} INFO - Started process 13 to run task
[2023-06-19, 16:00:12 EDT] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'example_kubernetes_operator', 'write-xcom', 'scheduled__2023-06-19T19:00:00+00:00', '--job-id', '10738', '--raw', '--subdir', 'DAGS_FOLDER/k8sop_demo_check.py', '--cfg-path', '/tmp/tmpsqvmnais']
[2023-06-19, 16:00:12 EDT] {standard_task_runner.py:83} INFO - Job 10738: Subtask write-xcom
[2023-06-19, 16:00:12 EDT] {task_command.py:388} INFO - Running <TaskInstance: example_kubernetes_operator.write-xcom scheduled__2023-06-19T19:00:00+00:00 [running]> on host example-kubernetes-operator-wr-d8621740146d4c76bc804a521fd291e3
[2023-06-19, 16:00:12 EDT] {pod_generator.py:424} WARNING - Model file /opt/airflow/pod_templates/pod_template_file.yaml does not exist
[2023-06-19, 16:00:12 EDT] {taskinstance.py:1507} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_kubernetes_operator
AIRFLOW_CTX_TASK_ID=write-xcom
AIRFLOW_CTX_EXECUTION_DATE=2023-06-19T19:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2023-06-19T19:00:00+00:00
[2023-06-19, 16:00:12 EDT] {kubernetes_pod.py:675} INFO - Building pod write-xcom-81tyqaqq with labels: {'dag_id': 'example_kubernetes_operator', 'task_id': 'write-xcom', 'run_id': 'scheduled__2023-06-19T1900000000-1bcf69b11', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2023-06-19, 16:00:12 EDT] {kubernetes_pod.py:454} INFO - Found matching pod write-xcom-81tyqaqq with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.5.1', 'dag_id': 'example_kubernetes_operator', 'kubernetes_pod_operator': 'True', 'run_id': 'scheduled__2023-06-19T1900000000-1bcf69b11', 'task_id': 'write-xcom', 'try_number': '1'}
[2023-06-19, 16:00:12 EDT] {kubernetes_pod.py:455} INFO - `try_number` of task_instance: 1
[2023-06-19, 16:00:12 EDT] {kubernetes_pod.py:456} INFO - `try_number` of pod: 1
[2023-06-19, 16:00:12 EDT] {pod_manager.py:187} WARNING - Pod not yet started: write-xcom-81tyqaqq
...etc, it eventually runs a second or two later

2 replies

pawelzajac-gc Jun 19, 2023

Hi @hanleybrand,
Thank you for your answer.
Indeed, it doesn't affect the run it seems to be a false positive.

potiuk Jun 23, 2023
Collaborator

Yet another point to geting SQL Server as falling out of experimental status to be "not supported"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesOperator jobs fail -- [probably NOT] because "pod_template_file.yaml does not exist" #30805

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

KubernetesOperator jobs fail -- [probably NOT] because "pod_template_file.yaml does not exist" #30805

hanleybrand Apr 21, 2023

Replies: 2 comments · 6 replies

potiuk Apr 21, 2023 Collaborator

hanleybrand Apr 26, 2023 Author

potiuk Apr 28, 2023 Collaborator

hanleybrand Apr 28, 2023 Author

pawelzajac-gc Jun 1, 2023

hanleybrand Jun 19, 2023 Author

pawelzajac-gc Jun 19, 2023

potiuk Jun 23, 2023 Collaborator

hanleybrand
Apr 21, 2023

Replies: 2 comments 6 replies

potiuk
Apr 21, 2023
Collaborator

hanleybrand Apr 26, 2023
Author

potiuk Apr 28, 2023
Collaborator

hanleybrand Apr 28, 2023
Author

hanleybrand
Jun 19, 2023
Author

potiuk Jun 23, 2023
Collaborator