Does anyone know why Airflow create repeating log on the scheduler log? #33714
-
Hi, As the discussion title say, does anyone here know why that happen? Here is snapshot of my scheduler log:
As you can see on the snapshot above, Airflow keeps repeating 8 lines logs. This is happening on scheduler log, for each of DAG I have. Here is the example the DAG I have:
At the moment, each log produce ~5mb/DAG/daily. And IMO, 5mb for log file is too much. Especially if the cause is repeating line like that. I know I can stop the log with change the AIRFLOW__LOGGING__LOGGING_LEVEL to WARNING or above level. But I need to keep logging level on INFO. Does anyone know how to adjust this? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Because scheduler/dag_processor continuously parse DAG from
I would start with this one
I would start worrying on this point, that might refers to some issues with you're DAG code, so better the check that your DAG follow Best Practices That how you logs should look like, please attention on: time to process, logs frequencies and next_dagrun [2023-08-23T00:00:33.588+0000] {processor.py:157} INFO - Started process (PID=57137) to work on /files/dags/ecs_sample.py
[2023-08-23T00:00:33.589+0000] {processor.py:820} INFO - Processing file /files/dags/ecs_sample.py for tasks to queue
[2023-08-23T00:00:33.590+0000] {logging_mixin.py:151} INFO - [2023-08-23T00:00:33.589+0000] {dagbag.py:539} INFO - Filling up the DagBag from /files/dags/ecs_sample.py
[2023-08-23T00:00:33.594+0000] {processor.py:830} INFO - DAG(s) dict_keys(['example_ecs']) retrieved from /files/dags/ecs_sample.py
[2023-08-23T00:00:33.608+0000] {logging_mixin.py:151} INFO - [2023-08-23T00:00:33.608+0000] {dag.py:2906} INFO - Sync 1 DAGs
[2023-08-23T00:00:33.619+0000] {logging_mixin.py:151} INFO - [2023-08-23T00:00:33.619+0000] {dag.py:3687} INFO - Setting next_dagrun for example_ecs to 2021-01-01T00:00:00+00:00, run_after=2021-01-01T00:00:00+00:00
[2023-08-23T00:00:33.629+0000] {processor.py:179} INFO - Processing /files/dags/ecs_sample.py took 0.044 seconds
[2023-08-23T00:01:04.029+0000] {processor.py:157} INFO - Started process (PID=57502) to work on /files/dags/ecs_sample.py
[2023-08-23T00:01:04.032+0000] {processor.py:820} INFO - Processing file /files/dags/ecs_sample.py for tasks to queue
[2023-08-23T00:01:04.033+0000] {logging_mixin.py:151} INFO - [2023-08-23T00:01:04.033+0000] {dagbag.py:539} INFO - Filling up the DagBag from /files/dags/ecs_sample.py
[2023-08-23T00:01:04.042+0000] {processor.py:830} INFO - DAG(s) dict_keys(['example_ecs']) retrieved from /files/dags/ecs_sample.py
[2023-08-23T00:01:04.058+0000] {logging_mixin.py:151} INFO - [2023-08-23T00:01:04.058+0000] {dag.py:2906} INFO - Sync 1 DAGs
[2023-08-23T00:01:04.070+0000] {logging_mixin.py:151} INFO - [2023-08-23T00:01:04.069+0000] {dag.py:3687} INFO - Setting next_dagrun for example_ecs to 2021-01-01T00:00:00+00:00, run_after=2021-01-01T00:00:00+00:00
[2023-08-23T00:01:04.080+0000] {processor.py:179} INFO - Processing /files/dags/ecs_sample.py took 0.053 seconds |
Beta Was this translation helpful? Give feedback.
-
Yes. Scheduler is writing logs for every action it performs, and it scans and re-parses (actually dag file processor does it) the files in your DAG folder regularly - you can see all the parameters of scheduler in https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#scheduler but the one that controls the frequency is https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#min-file-process-interval There multiple reasons why just "scanning once" is not enough:
Regarding the log entries - Airflow provides a basic configuration of loggers and handlers following the standar Python logging framework and assumes that it's up to your deployment to do somothing with the logs. For example, in our chart there is a log rotation script that simply cleans up old logs. This is pretty standard approach of modern applications that they just produce the logs and whatever happens with the logs is up to Deployment that does something with it. Generally Airflow does not make opinionated decisiosn on wheter to avoid similar duplicated lines or not it just logs whatever happens. And leaves the decision what to do with it for you. So you can do whatever you want there and you have two choices:
|
Beta Was this translation helpful? Give feedback.
-
I see our posts with @Taragolis crossed - and we gave you similar answers :) |
Beta Was this translation helpful? Give feedback.
Because scheduler/dag_processor continuously parse DAG from
.py
files everymin_file_process_interval
(30 sec by default)I would start with this one
-Fine-tuning your Scheduler performance
-Advanced Logging
I would start worrying on this point, that might refers to some issues with you're DAG code, so better the check that your DAG follow Best Practices
That how you logs should look like, please attention on: time to process, logs frequencies and next_dagrun
[2023-08-23T00:00:33.588+0000] {pro…