KubernetesOperator jobs fail -- [probably NOT] because "pod_template_file.yaml does not exist" #30805
-
Airflow 2.5.1; Helm chart 1.8.0 I had recently started working with KubernetesOperator to check it out, and initially my jobs were working as I expected, but recently they all fail and it looks like it's because the airflow pod_template is not being applied to the container when it's spun up. The main symptom was anything that relied on a k8s Secret failed -- and then I noticed this in the logs of failed jobs:
Of course, I don't think I changed anything that would have changed this, but it's certainly possible that I did -- the big change I did recently (in the last few weeks) was realize that running airflow as uid/gid 1000/1000 was incorrect and I switched my image and airflow values.yaml to do the default of 50000/0 -- but I don't think this would have changed the pod_template_file from being included The file exists on the scheduler pod at config:
__stuff_deleted__: 'deleted irrelevant config attributes'
celery_kubernetes_executor:
kubernetes_queue: 'kubernetes'
kubernetes_executor:
namespace: '{{ .Release.Namespace }}'
airflow_configmap: '{{ include "airflow_config" . }}'
airflow_local_settings_configmap: '{{ include "airflow_config" . }}'
pod_template_file: '{{ include "airflow_pod_template_file" . }}/pod_template_file.yaml'
worker_container_repository: '{{ .Values.images.airflow.repository | default .Values.defaultAirflowRepository }}'
worker_container_tag: '{{ .Values.images.airflow.tag | default .Values.defaultAirflowTag }}'
multi_namespace_mode: '{{ ternary "True" "False" .Values.multiNamespaceMode }}' So if the pod_template_file exists at I checked the git history and the only change I've made since the K8sOperator jobs ran successfully was changing config.kubernetes to config.kubernetes_executor somewhere in there (to stop getting the error messages about how it had been changed), otherwise I think the values.yaml is pretty vanilla (no customization for the template, etc.) Does this issue ring a bell with anyone? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
Likely this user/group have no access to the file/any folder on the path. |
Beta Was this translation helpful? Give feedback.
-
I forgot to update this thread, I apologize -- after @potiuk answered, I decided to stop assuming the errors in the task log were the cause of the problem I was trying to solve ("KubernetesOperator jobs fail") and dug deeper, and it started looking like the failures were not failures as much as they were the scheduler and triggerer getting tripped up by something, and further delving made me begin to suspect that choosing SQL Server for my meta db wasn't the masterstroke I thought it was, and that ended up being correct as far as I can tell -- I changed my deployment to pg using Postgres Operator and since then a number of weird problems I was seeing seem to have been resolved. I never did resolve the source of the error I'm guessing it has to be one of them because the error occurs prior to the pod being built - this is the output from one of the \ kubernetes operator examples -- I had put it up back when I was having this problem to try to find my issue, but other than the error message in the task log, I can't find any errors in any other logs.
|
Beta Was this translation helpful? Give feedback.
I guess normal. I do not know it all by heart - normally I'd have to go to source code to check. I have no idea what's wrong but you have not mentioned which component prints the error in their log. Maybe that will give you a clue - you wil be able to see whether this component has the file or not. I think the best way to see what's going on is to use
helm install --dry-run
if in doubt (and there is also a helm-diff plugin). That's the easiest way to see k8s resources created. It will print all the resources and you will be able to see what gets mounted where.