vine: separate blocked and ready tasks #4275

JinZhou5042 · 2025-11-06T21:28:04Z

Proposed Changes

A new data structure struct list *blocked_tasks is added to track tasks that cannot be run.

We rotate the list for a fixed amount of time, by default q->scheduling_depth (100), to give all tasks an equal chance of being reconsidered, and the priority queue is populated with eligible tasks for fast dispatch.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

make test Run local tests prior to pushing.
make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
make lint Run lint on source code prior to pushing.
Manual Update: Update the manual to reflect user-visible changes.
Type Labels: Select a github label for the type: bugfix, enhancement, etc.
Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
PR RTM: Mark your PR as ready to merge.

…_and_ready_tasks

…' into separate_pending_and_ready_tasks

…_and_ready_tasks

btovar · 2025-11-10T15:36:32Z

Can we implement this with the changes to the priority queue? That is, give blocked tasks a small priority and not change anything else in the code?

JinZhou5042 · 2025-11-10T17:03:02Z

I agree that implementing the whole thing with flexible priorities may mitigate the problem, but I don't see it as a solid abstraction for a long-term solution.

For ineligible tasks, demoting them to the blocked list or lowering their priority serves the same purpose, which is to delay the execution and consider later. What's elegant of demoting is that it gives us distinct states to track tasks, which allows us to decide what operations are appropriate on each of the states.

For ready tasks, we simply check whether they are runnable and whether a suitable worker exists. So the only operation needed is to constantly peaking at the top until it becomes empty, which keeps it nimble and neat.

For blocked tasks, we not only check if it's runnable but also why it can't be run. For example, whether their resources requirement exceed what any worker can provide, or if they have fixed location constraints. These checks can be expensive so instead of performing them on all tasks by iterating the priority queue as a list, maintaining a separate blocked list makes it simple and lightweight.

As it stands, even if recovery tasks are removed in the near future, there are still other cases where a task can become ineligible and should be delayed, to which I think adding a blocked list is more reasonable.

btovar · 2025-11-10T17:20:52Z

If a task has a 'blocked' priority, we can still check why it is blocked. It is not too much about the efficiency, as I think both methods are about the same cost, but rather reducing the complexity of the code. In the future, every time we need to update the ready task queue we will have to do the same for the blocked task queue, and I rather do not increase the state that we have to keep.

JinZhou5042 · 2025-11-10T17:54:34Z

I might be wrong, but I guess adding the blocked queue does reduce the complexity of the code? Instead, it makes tasks more maintainable.

On the one hand, distinguishing the two states allow us to adopt different operations to each without mixing them up.

On the other hand, lowering the priority doesn't solve the fundamental problem: the blocked tasks are still left in the scheduling queue and might be reconsidered several times, their scheduling frequency and order may beyond our control over time. Meanwhile, it bings two other problems: 1) tasks with positive and negative priorities should be treated differently, and 2) to what extend the priority value should be lowered.

If a future change is needed to the ready queue, we don't need to apply the same thing to the blocked queue, because blocked tasks will be rotated to the ready queue and handled there soon or later.

btovar · 2025-11-10T21:21:11Z

The blocked queue is simply implementing lowest priority?

As the recent interaction with parsl+wq, priority is important for users and we should have a system that works well with resetting the priority queue when resources become available (e.g. a workers joins or a task result is return). Given this, if we assign lowest priority to "blocked" tasks they will get checked only when there is nothing else to do.

The blocked list helps when priority is not important so that we can rotate the priority queue. However, the operation of taskvine itself now depends on priorities (resource exhaustion and recovery tasks).

Thus, if we add a blocked list we should remove priorities from taskvine, as otherwise it is very hard to predict what taskvine is doing. Since priorities are important to users, I don't think this is the way we want to go.

btovar · 2025-11-10T21:23:42Z

And also, I guess we are talking different complexities... From my perspective adding a second data structure increases the complexity of the code, as we need to make sure to maintain it correctly across all possible state transitions. We may be able to do it today when we have fresh in our minds, but not in a year when we forgot what the code was doing.

JinZhou5042 · 2025-11-11T14:10:18Z

That's a fair argument, adding an additional data structure would increase the maintainance complexity in the future... I think we can go with that. Then if we lower the priority of tasks when they are found to be ineligible, the iterating cursor doesn't appear to be helpful. Do we want to get rid of that feature and stick with peaking at the top?

btovar · 2025-11-11T15:08:41Z

What if we add the blocked list but only for logical blocks? If an input file is not there, then the task is put in the block list.
The only responsibility of rotate_blocked_tasks is to see if the task can be moved to the ready list. Tasks without libraries, or missing fixed inputs should not be put in the blocked list, they should be returned immediately instead. Tasks without resources should not be put in the blocked list, as when resources become available tasks with less priority may be scheduled instead.

If this makes sense, then the changes in this pr should be reduced, e.g., expiring_tasks function should not be removed/changed. You may want to add different return values to consider_task with an enum so you know if it was logical, resources, etc. what prevented the task to be scheduled.

JinZhou5042 · 2025-11-11T15:26:25Z

I think this is a good approach, and I definitely understand that there are too many things going on in the scheduling function so we need to be very circumspect! Returning tasks with inputs missing or library missing seems to be the direction we'll have to take sooner or later. Given that the implementation requires adhering the overall architecture and some coding philosophy behind it, I'm afraid that I couldn't get it right. Hou about you check it out later? That way we might have a more clear idea of what to do here!

btovar · 2025-11-11T15:43:04Z

Sounds good! Like me make a sketch to see if it does what you need!

dthain · 2025-11-11T18:14:05Z

Jin, I would like for you to keep the responsibility of implementing this. You and Ben are having a constructive discussion about how it should work. I want you guys to come to a shared understanding, and then Jin should implement it.

btovar · 2025-11-11T20:35:38Z

Got it!

JinZhou5042 · 2025-11-12T07:46:19Z

To start the ball rolling, this is the new runtime model I'm sketching out:

We have three participants:

Executor: The end user, who could be a pure TaskVine client or a higher-level graph executor.
Ready Queue: Stores runnable tasks, whose runtime requirements have all been satisfied, and for which at least one suitable worker is available. Tasks are maintained and considered for eligibility strictly according to their priority values.
Blocked List: Stores unrunnable tasks constrained either by a missing input or by the absence of any suitable worker. Tasks are maintained in a rotating list, and are popped one by one to verify both their expiry conditions and eligibility.

And four actions:

Submit: A user can submit a task directly.
Demote: A ready task is failed to be submitted to a worker, for whatever reason.
Promote: A blocked task becomes eligible when all requirements are met.
Expire: A blocked task expires if certain conditions are met, such as the absence of a library, a lost input file, or a missing fixed-location requirement. Expired tasks are then returned to the user.

Some upsides of this design:

It's easy to implement the idea of returning tasks with missing inputs/library to the user via the blocked queue. This is also extendable to other conditions where we want to expire tasks under certain conditions such as excceded start/end time.
Each component performs a specific function: the ready queue relentlessly dispatches tasks from the top of the queue, while the blocked queue handles task ineligibility and makes rational decisions.
There is no need to maintain cumbersome or confusing cursors and reset for emurative reasons.

Though I definitely have no objection if we decide to use the one-queue approach, the added complexity and future maintenance issues also need to be considered.

btovar · 2025-11-12T14:35:15Z

Jin, before you continue adding to this pr, please split it into smaller chunks. For example, the first chunk should only add the block list with no other changes to the expiring list function. Since this is a structural change, lets split it into changes that we can more easily evaluate and think about.

JinZhou5042 · 2025-11-12T17:57:15Z

Sounds great!

JinZhou5042 and others added 26 commits May 1, 2025 18:20

vine: separate pending and ready tasks

f6e2bc8

rename func name

9abf3e5

aggressively send tasks

89fd467

just send one task

fbcb4f4

Merge branch 'cooperative-computing-lab:master' into separate_pending…

a47e419

…_and_ready_tasks

use vine_schedule_have_committable_resources

8f234f3

static

2333d65

vine_schedule_rotate_pending

904b08b

revert

96162e9

adjust location

de0d70d

fix int done

aff40f7

Merge branch 'cooperative-computing-lab:master' into separate_pending…

8961a15

…_and_ready_tasks

Merge branch 'cooperative-computing-lab:master' into separate_pending…

00c7300

…_and_ready_tasks

Merge branch 'cooperative-computing-lab:master' into separate_pending…

b269fa3

…_and_ready_tasks

vine_schedule_count_committable_cores

7f589f2

Merge branch 'cooperative-computing-lab:master' into separate_pending…

84c135c

…_and_ready_tasks

Merge branch 'cooperative-computing-lab:master' into separate_pending…

f944fc6

…_and_ready_tasks

merge

a217aa8

Merge remote-tracking branch 'origin/separate_pending_and_ready_tasks…

ed57fe0

…' into separate_pending_and_ready_tasks

Merge branch 'cooperative-computing-lab:master' into separate_pending…

8b9c37c

…_and_ready_tasks

Merge branch 'cooperative-computing-lab:master' into separate_pending…

c9ccf6a

…_and_ready_tasks

Merge branch 'cooperative-computing-lab:master' into separate_pending…

5d00612

…_and_ready_tasks

calculate time spent on scheduling

1575045

gpus

c107a9c

Merge branch 'cooperative-computing-lab:master' into separate_pending…

0345259

…_and_ready_tasks

update

8b8f392

JinZhou5042 changed the title ~~Separate blocked and ready tasks~~ vine: separate blocked and ready tasks Nov 6, 2025

JinZhou5042 self-assigned this Nov 6, 2025

Jin Zhou added 2 commits November 6, 2025 16:29

rename

00f68b2

rename

4631629

Jin Zhou added 7 commits November 6, 2025 22:20

no reset

5bd9eb4

no gpu count

e5f1709

revert vine_schedule_check_for_large_tasks

9a714b9

no skipped_tasks

671e14b

check large tasks on blocked_list

168f53f

check task and worker

0728453

support task group

1b6483e

btovar mentioned this pull request Nov 11, 2025

Blocked list #4280

Closed

7 tasks

Jin Zhou added 2 commits November 12, 2025 01:25

clean up pq cursors

ce15f7a

remove nothing_happened_in_the_last_cycle

5b2b5d3

Jin Zhou added 4 commits November 12, 2025 03:14

fix file missing

dc84ca4

fix comment

483670f

bug fix

8a473d8

remove file_neds_recovery

9f1062c

rename var

c7e6ac4

vine: separate blocked and ready tasks #4275

Are you sure you want to change the base?

vine: separate blocked and ready tasks #4275

Uh oh!

Conversation

JinZhou5042 commented Nov 6, 2025

Proposed Changes

Merge Checklist

Uh oh!

btovar commented Nov 10, 2025

Uh oh!

JinZhou5042 commented Nov 10, 2025

Uh oh!

btovar commented Nov 10, 2025

Uh oh!

JinZhou5042 commented Nov 10, 2025

Uh oh!

btovar commented Nov 10, 2025

Uh oh!

btovar commented Nov 10, 2025

Uh oh!

JinZhou5042 commented Nov 11, 2025

Uh oh!

btovar commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JinZhou5042 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

btovar commented Nov 11, 2025

Uh oh!

dthain commented Nov 11, 2025

Uh oh!

btovar commented Nov 11, 2025

Uh oh!

JinZhou5042 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

btovar commented Nov 12, 2025

Uh oh!

JinZhou5042 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

btovar commented Nov 11, 2025 •

edited

Loading

JinZhou5042 commented Nov 11, 2025 •

edited

Loading

JinZhou5042 commented Nov 12, 2025 •

edited

Loading