Retry for TaskGroup #21333

sartyukhov · 2022-02-04T18:10:10Z

sartyukhov
Feb 4, 2022

Description

Hello!

Previously, a SubDag was used to organize tasks into groups. Now you've introduced a TaskGroups to the world .
It's nice and very clever. But it has a one big disadvantage over the SubDag - it cant be repeated.

Use case/motivation

For example:

In a project I have two task (A >> B):
A - collect data (PythonOperator)
B - update material view in postgres (PostgresOperator)

'A' could collect only part of data and mark itself as failed (there is no "half-failed" status as I know). But task 'B' should run regardless of A`s result (trigger_rule="all_done" for example) to update matview with part of data.
In an ~ hour I would like to repeat that process (A >> B).

With SubDag I could do that:

initiate SubDag with parameter retries=10
add DummyTask 'C' with trigger_rule="all_success"
change flow to A >> B >> C and A >> C

and that's it, C marks dag as failed and trigger it to retry.

But TaskGroup does not have retry parameter.
I also can't retry whole DAG, because it's big.
I also don't want to update material view inside task 'A' because in that way I can't do [A0, A1..An] >> B (update material view just once for several collects).

I hope it's possible. Or maybe it could be done some other way.
Thanks in advance.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2022-02-04T19:06:51Z

potiuk
Feb 4, 2022
Collaborator

This is a discussion, not an issue - please open "GitHub Discussion" next time. I hope someone who has an experience with operating Airlfow will help you with answers.

2 replies

sartyukhov Feb 5, 2022
Author

Issue description says

Airflow feature request
Suggest an idea for this project

I suggested idea to add retry in TaskGroup. Whats wrong?

potiuk Feb 5, 2022
Collaborator

There is nothing wrong. Discusssion is not "worse" - it's just "let's discuss if this is a feature, if it is missing, or maybe can be done differently". That's what "discussions" are about.

You are unsure yourself if this is something missing or maybe it can already be done. I do not know either. But maybe there are people that will suggest some solutions for that. If it turns out that it is possible, you might learn it here. If some other people have experience with it - they might share it here.

Be patient and ping you do not see other people discussing and explaining how they have done it.

BTW. Creating feature issue, does not mean much. Almost nothing actually and unless it is very clear if it is needed it's mostly noise that's whe turn such "maybe features" into discussions - to distinguish the "real features that we are sure we want" from "maybe, maybe, we are not sure if this is a feature at all".

Airflow is created by almost 2000 contributors. Feature reaquests like this will usually only be implemented if there are people convinced that things should be done. So no matter what, in case of such feature you would have to convince others that it is needed (or most likely propose a PR yourself to fix it if you would be interested in getting it - and then it would have to pass the review and approaval).

Discussion is a great way to see:

if there is interest at all
if there is someone from the committers who might be interested in implementing it
or maybe whether you yourself will have to take the lead and implement it if there is no-one else implemented (but checking via discussion might be a great way to avoid any work if it turns out to be not needed or not welcomed by the community at all).

sartyukhov · 2022-02-14T07:58:56Z

sartyukhov
Feb 14, 2022
Author

Soo... no ideas? Am I doing it right?

1 reply

potiuk Feb 14, 2022
Collaborator

Maybe no-one is interested. This happens. You can send a message to devlist if you want to discuss. Or propose a PR directly (discussion over code is best). The way how it works is that you either interest someone and get that somoene implement or (best and fastest) implement it on your own and contribute.

victorfuzaro · 2022-02-25T17:06:55Z

victorfuzaro
Feb 25, 2022

+1 this feature would be very useful.

0 replies

turbaszek · 2022-02-26T18:43:17Z

turbaszek
Feb 26, 2022
Collaborator

Why not use retries on task that you expect may fail? I think retires on task level are best in terms of flow control: retry A until it succeeds, retry B until it succeeds and so on. It's probably also cheapest in terms of amount of work that is executed on retry.

4 replies

victorfuzaro Feb 28, 2022

I have a specific use case where this feature would be useful. It is like:

There is a task to do one thing
There a second task (which depends on the first one) that does another thing, if this one fails I'll need to re-run the entire dag. I can't do both processes in the same task due to some limitations (I work with different java drivers on each one) and retrying the same task doesn't solve the problem because the result of this task will imply whether or not the first dag would need a re-execution.

Clear the previous task(s) also isn't good because it'll cause an infinite loop until everything succeeds, which is not exactly good, at least for me I would need only some 3-5 retries until it keeps a failed state.

My workaround for this was creating a dag that will trigger this dag, so if the triggered dag state is failed it'll re-execute the amount of times I set. However as you can see, it makes necessary the creation of 2 dags for solving the problem.

potiuk Feb 28, 2022
Collaborator

That's an interesting one. I will turn it into feature back. I think it has a value in the scenario you described @victorfuzaro and maybe someone will be interested to pick that one up (or maybe you could take a stab on it :) )

sartyukhov Feb 28, 2022
Author

@turbaszek
Hi!

I described why it's bad to retry whole dag in original post (because A -> B it's just a little part of it).
Retry A till success? But I need to make partial update (run B) after it.

turbaszek Mar 11, 2022
Collaborator

@victorfuzaro that's really interesting problem, thanks for describing the details 👍

turbaszek · 2022-03-11T22:46:34Z

turbaszek
Mar 11, 2022
Collaborator

@victorfuzaro @sartyukhov I'm wondering how do you expect to TaskGroup retries work. Let say:

with TaskGroup(..., retries=4) as tg:
  A = DummyOperator(task_id="A", retries=3)
  B = DummyOperator(task_id="B")
  A >> B

The simples scenario where A succeeds (in first or next try) and B fails would be handled by retrying whole tg. However, if A fails in 3 retries - what do we do?

a) Should we allow 3 retries of A and then retry whole tg? (task.retries = task.retries + tg.retries)
b) Should we ignore task-level retries if tg has non-default retries? (Possibly raise w warning/exception when using retries within tg)

Probably a) sounds reasonable but it may affect db state of TaskInstances because we will end up with try_number bigger than max_tries.

@kaxil @XD-DENG @potiuk what do you think?

4 replies

sartyukhov Mar 12, 2022
Author

In my mind it should be as transparent as possible.
In your example it looks like A can retries maximum 12 times.

victorfuzaro Mar 12, 2022

I think your example is pretty good. In my opinion we should keep everything as it is now, therefore allowing tg retries + task retries. I believe that ignoring task retries probably would be a problem for some people, so let the users choose what is better for them :)

sartyukhov Mar 12, 2022
Author

@turbaszek do you have any ideas on how to add this functionality to the TaskGroup?

turbaszek Mar 13, 2022
Collaborator

@sartyukhov I think the process should be handled in TaskInstance.handle_failure. If the task failed and not is_eligible_to_retry but it belongs to task group (we should be able to access it via self.task.task_group, where self is a TaskInstance) then we should somehow enqueue again all upstream tasks in this task group. I say "somehow" because I'm not sure how we should update database state.

maxyousif15 · 2022-06-17T10:00:26Z

maxyousif15
Jun 17, 2022

Any news on this?

Would really love to see this. I have a peculiar use case for repeating entire TaskGroups.

I run a lot of scraping pipelines. Those are typically followed by parsing and translation steps, which fit the retrieved data into our own internal schema. Occasionally, the data itself is corrupted because the service is down intermittently. As a result, all downstream tasks should fail. However, we do not validate the data at the scraping step, because we essentially collect all the content into a WARC (internet archive) file so that we can validate the data downstream. In cases like these, the retry should retry the scraping as well, even though technically the scraping task was deemed a success.

There are two options here:

Expensive option - change the entire architecture of scraping so that we validate data immediately after scraping.
Allow retry on task groups, so that if the parsing or translating fails, we re-scrape before attempting to parse and translate again.

Option 2 is what I believe is being described above, and is the preferred option.

1 reply

potiuk Jun 17, 2022
Collaborator

The news is that no-one works on it. But if would like to make a PR, feel free - Airflow has more then 2000 contributors, so any one might do it. If you do not feel like able to implement it, you can at least start advocating, getting consensus that Task Groups should be converted from UI-only construct to something more and pushing and possibly finding someone who would like to implement it - one thing I can suggest you is to start a discussion at the devlist - this is big and serious enough change that it should be discussed there IMHO.

This is how Open-Source works - if there is no-one leading a change, the change will not happen. Often people who lead the change, also implement it, but it's not necessary, however active leading a change is an absolute requirement for a change to happen.

The devlist is here: https://lists.apache.org/[email protected] , You can see information how to join it here: https://airflow.apache.org/community/ - if you really want to get it done - leading the change and finding support and consensus there is the fastest way to make the change happen.

Retry for TaskGroup #21333

sartyukhov Feb 4, 2022

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

Replies: 6 comments · 12 replies

potiuk Feb 4, 2022 Collaborator

sartyukhov Feb 5, 2022 Author

potiuk Feb 5, 2022 Collaborator

sartyukhov Feb 14, 2022 Author

potiuk Feb 14, 2022 Collaborator

victorfuzaro Feb 25, 2022

turbaszek Feb 26, 2022 Collaborator

victorfuzaro Feb 28, 2022

potiuk Feb 28, 2022 Collaborator

sartyukhov Feb 28, 2022 Author

turbaszek Mar 11, 2022 Collaborator

turbaszek Mar 11, 2022 Collaborator

sartyukhov Mar 12, 2022 Author

victorfuzaro Mar 12, 2022

sartyukhov Mar 12, 2022 Author

turbaszek Mar 13, 2022 Collaborator

maxyousif15 Jun 17, 2022

potiuk Jun 17, 2022 Collaborator

sartyukhov
Feb 4, 2022

Replies: 6 comments 12 replies

potiuk
Feb 4, 2022
Collaborator

sartyukhov Feb 5, 2022
Author

potiuk Feb 5, 2022
Collaborator

sartyukhov
Feb 14, 2022
Author

potiuk Feb 14, 2022
Collaborator

victorfuzaro
Feb 25, 2022

turbaszek
Feb 26, 2022
Collaborator

potiuk Feb 28, 2022
Collaborator

sartyukhov Feb 28, 2022
Author

turbaszek Mar 11, 2022
Collaborator

turbaszek
Mar 11, 2022
Collaborator

sartyukhov Mar 12, 2022
Author

sartyukhov Mar 12, 2022
Author

turbaszek Mar 13, 2022
Collaborator

maxyousif15
Jun 17, 2022

potiuk Jun 17, 2022
Collaborator