Replies: 6 comments
-
Thanks for opening your first issue here! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
Yes, this would be a good feature! What most people seem to do right now as a work around is to have a special "backfill dag" that does the batching. We (collectively) will need to spend some time designing an interface for this, and then likely raise it as an Airflow Improvment Proposal https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals I'll happily help you with this process. |
Beta Was this translation helpful? Give feedback.
-
This sounds good! I wonder if we could mix it with triggering backfill externally #11302 |
Beta Was this translation helpful? Give feedback.
-
Yeah absolutely. I'll raise an AIP for backfill improvements when I get the chance and we can discuss what the scope of that should be there. |
Beta Was this translation helpful? Give feedback.
-
@jward-bw I did a small draft last week so feel free to use it: |
Beta Was this translation helpful? Give feedback.
-
Very old. If this is still an issue, let's move it into a discussion. |
Beta Was this translation helpful? Give feedback.
-
Description
Make it possible to merge multiple backfills into a single run, by extending the
start_date
of a single dagrun to cover a time period inclusive of all backfills.Use case / motivation
There are cases where running multiple backfills is less efficient than having a single run, for example where tasks in successive runs would do duplicate work.
An example:
execution_date
and thenext_execution_date
macro.At this point we have 18 hours of data to catch up on. Assuming the external issue has been fixed, this would take on average 12 hours to process, meaning further delays to processing future jobs. If instead we could merge these runs into a single backfill, this would reduce the processing time from 12 hours to something like 6 hours, greatly reducing the impact of delayed processing and also resource usage on Airflow and HBase (in this case, but in general other external services).
This issue of inefficient processing is one that I (and I'm sure others) have a need to solve. There are obviously other workarounds one could do but I don't think they are correct in the sense of Airflow good practices. For example:
All of these have their own pitfalls and invariably involve some other manual intervention in Airflow to ensure the database is kept accurate and/or future runs aren't affected.
If there is some other solution to this problem that I am unaware of, please let me know. I have raised this as an RFC as any change that implements this feature would touch many areas of the code base, so would require some planning.
Beta Was this translation helpful? Give feedback.
All reactions