Replies: 12 comments
-
I find no value with the current SLA feature. I don't know many people who are using it due to it's limitations. |
Beta Was this translation helpful? Give feedback.
-
Yeah. I think we should have a more comprehensive/better/complete/usable SLA implementation replacing the current one. |
Beta Was this translation helpful? Give feedback.
-
Seems like this could use the new triggerer service. Essentially, it is like branching out and having a suspended task with a trigger that activates at the deadline. |
Beta Was this translation helpful? Give feedback.
-
Hi @malthe, could you elaborate a bit more? What's "triggerer service" and how can it be used to warn about the deadline? |
Beta Was this translation helpful? Give feedback.
-
@yuqian90 what I'm referring to are the new deferrable operators. There's a framework in there which allows us to set up future actions such as reacting to a "missed deadline". It might need a little reworking in order to implement SLAs but I think it's pretty close since you could also just branch out and use the new DateTimeSensorAsync. Note that this framework is available only from Airflow 2.2 onwards. |
Beta Was this translation helpful? Give feedback.
-
Yeah, this feels like a good use for async sensors indeed. I think it's even possible to do it with the old sensors (with obvious resource issues, but possible if you have practically unlimited worker resource). |
Beta Was this translation helpful? Give feedback.
-
Maybe I am not understanding the ticket, but I don't find this to be true:
if I have a DAG that should be hourly and for example I set a 1-hour SLA, I will get an SLA miss email if a single DAG run is still running past 1 hour. If a DAG finished and it failed it will use I do agree that there should be better control for task-specific deadlines, but also this can be accomplished partially today by putting the part of the DAG that needs a deadline should be by itself in a separate DAG with an SLA in place, and then the remainder of a DAG will be in a 2nd task and use |
Beta Was this translation helpful? Give feedback.
-
I tried to use |
Beta Was this translation helpful? Give feedback.
-
I believe the intention of using ExternalTaskSensor with a deadline planned is to set a timeout and have no retries so if it doesn't complete by the timeout it fails, not just run forever. |
Beta Was this translation helpful? Give feedback.
-
Also as of Airflow 2.2.0 I think sensors will no longer retry if the sensor times out: #12058 |
Beta Was this translation helpful? Give feedback.
-
Thank you for the information. |
Beta Was this translation helpful? Give feedback.
-
I think this is a wider effort - improving SLAs. We know current SLAs are not really useful and there should be an effort to redefine SLA behaviour basically from scratch. However it requires devlist discussion and AIP added here as this is important, big fetature to add to Airflow. https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals Converting it to discussion, but would be great if somoene starts an effort around creating "New SLA AIP". |
Beta Was this translation helpful? Give feedback.
-
Description
We often need to be notified if a task is not finished by certain deadline. This may sound very similar to the existing SLA concept, but unfortunately, the SLA implementation makes it not useful for such cases for a few reasons:
sla_miss_callback
is not fired.execution_date
. But we may have deadlines specified in various timezone that is difficult to define as a simple timedelta relative to theexecution_date
. Some DAGs are triggered externally, meaning they don't have a fixed schedule or fixed start time, makingdag.following_schedule()
and thus SLA not to work.Given all these shortcomings of SLA, I'm proposing to create a new task level concept called
deadline
and its correspondingdeadline_miss_callback
, wheredeadline
is defined as a jinja-template str that can be converted to apendulum.DateTime
object, anddeadline_miss_callback
is a callable to be called if the task is not finished by the givendeadline
.The alternative is to revamp SLA to become a timezone-aware datetime object rather than a timedelta, and making sure
sla_miss_callback
is called at the deadline rather than after the task finishes. These two changes may make the SLA concept very different from what it currently is.Use case/motivation
For example, given the simple DAG shown below, we need to know at
20210904 05:00 America/New_York
if thegenerate_model
task of the 20210903 DagRun is not yet finished. So if eitherdownload_file
orgenerate_model
takes too long, causinggenerate_model
not to be finished by this deadline, the users should be notified.Related issues
I see a few attempts to revamp/improve SLA. There's some overlap, but none of them does exact what's needed here.
#12008
#16389
#8545
Are you willing to submit a PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions