-
Hi, We have been running a few Airflow instances for the past few years and they have been very stable. We have very lightweight needs. We have upgraded airflow a few times and they have continued to be stable until this issue. Over the past couple of months, all of the instances have gone into a cpu overflow condition on various Airflow subprocesses; however, the issue has occurred at different times on each server seemingly randomly and we have not been able to identify a trigger for the condition. Any ideas on further troubleshooting would be appreciated. EnvironmentAll servers - Red Hat Enterprise Linux Server 7.9, LocalExecutor IssueAirflow subprocesses: (all: airflow scheduler - DagFileProcessor, airflow task runner, airflow task supervisor) begin to consume a full core of the machine or as much as they can get. There are no errors, tasks still complete successfully but they get slower and slower until the machine is overtaken. A task that generally takes 10 sec to run begins to run slower until it takes several minutes to run. Machine CPU usage climbs similarly. Memory usage on the server is flat and unchanging. On a healthy instance, the airflow subprocesses consume only a percent or two of a core. (We run at all default scheduler parameter settings). The issue cannot be resolved by stopping and restarting the airflow instance. However, if the linux server is restarted, the issue is resolved for days or weeks before it arises again. Things observed
The fact that restarting the linux server resolves the issue but restarting the airflow instance does not is a puzzlement for me and I have run out of ideas. As such, I thought I would reach out in hopes that someone may have additional insight or ideas that may help find the cause of the issue. Thanks for any ideas or input you may have. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 9 replies
-
IT's really hard to say but maybe a good idea would be use some profilers and see what Airflow is doing when it goes to "overdrive". Tools like https://github.com/benfred/py-spy can connect to running python interpreter and dump information on what's going on. Maybe you could try it and provide more information? Without it, it is shooting in the dark. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the suggestion @potiuk. We will give it a try and let you know what we find. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
While I have not discovered the root OS item that is progressively overloading causing the db connection time to hang and be slow, PgBouncer did indeed resolve the issue. Thanks for that prompt. |
Beta Was this translation helpful? Give feedback.
Thanks for being so diligent I would love to hear what it is.
Q: Are you using PgBouncer beetween Airflow and your Postgres ? Because if you DON't then THIS is the most probable reason. Airflow opens a lot of connection to the database and that might cause the memory usage of postgres to go ballistic if you have a lot of parallel tasks and dag file processor parsing tasks. Postgres works in the way that it spawns a new process for every connection, it does not multiplex them in the way that mysql does, so well configured PGBouncer is pretty much a MUST if you want to use Postgres.
If Postgres eats all memory and start swapping out then you might indeed have a VERY slow connection establis…