Skip to content

Conversation

@JinZhou5042
Copy link
Member

Proposed Changes

A new data structure struct list *blocked_tasks is added to track tasks that cannot be run.

We rotate the list for a fixed amount of time, by default q->scheduling_depth (100), to give all tasks an equal chance of being reconsidered, and the priority queue is populated with eligible tasks for fast dispatch.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

JinZhou5042 and others added 26 commits May 1, 2025 18:20
@JinZhou5042 JinZhou5042 changed the title Separate blocked and ready tasks vine: separate blocked and ready tasks Nov 6, 2025
@JinZhou5042 JinZhou5042 self-assigned this Nov 6, 2025
@btovar
Copy link
Member

btovar commented Nov 10, 2025

Can we implement this with the changes to the priority queue? That is, give blocked tasks a small priority and not change anything else in the code?

@JinZhou5042
Copy link
Member Author

I agree that implementing the whole thing with flexible priorities may mitigate the problem, but I don't see it as a solid abstraction for a long-term solution.

For ineligible tasks, demoting them to the blocked list or lowering their priority serves the same purpose, which is to delay the execution and consider later. What's elegant of demoting is that it gives us distinct states to track tasks, which allows us to decide what operations are appropriate on each of the states.

For ready tasks, we simply check whether they are runnable and whether a suitable worker exists. So the only operation needed is to constantly peaking at the top until it becomes empty, which keeps it nimble and neat.

For blocked tasks, we not only check if it's runnable but also why it can't be run. For example, whether their resources requirement exceed what any worker can provide, or if they have fixed location constraints. These checks can be expensive so instead of performing them on all tasks by iterating the priority queue as a list, maintaining a separate blocked list makes it simple and lightweight.

As it stands, even if recovery tasks are removed in the near future, there are still other cases where a task can become ineligible and should be delayed, to which I think adding a blocked list is more reasonable.

@btovar
Copy link
Member

btovar commented Nov 10, 2025

If a task has a 'blocked' priority, we can still check why it is blocked. It is not too much about the efficiency, as I think both methods are about the same cost, but rather reducing the complexity of the code. In the future, every time we need to update the ready task queue we will have to do the same for the blocked task queue, and I rather do not increase the state that we have to keep.

@JinZhou5042
Copy link
Member Author

I might be wrong, but I guess adding the blocked queue does reduce the complexity of the code? Instead, it makes tasks more maintainable.

On the one hand, distinguishing the two states allow us to adopt different operations to each without mixing them up.

On the other hand, lowering the priority doesn't solve the fundamental problem: the blocked tasks are still left in the scheduling queue and might be reconsidered several times, their scheduling frequency and order may beyond our control over time. Meanwhile, it bings two other problems: 1) tasks with positive and negative priorities should be treated differently, and 2) to what extend the priority value should be lowered.

If a future change is needed to the ready queue, we don't need to apply the same thing to the blocked queue, because blocked tasks will be rotated to the ready queue and handled there soon or later.

@btovar
Copy link
Member

btovar commented Nov 10, 2025

The blocked queue is simply implementing lowest priority?

As the recent interaction with parsl+wq, priority is important for users and we should have a system that works well with resetting the priority queue when resources become available (e.g. a workers joins or a task result is return). Given this, if we assign lowest priority to "blocked" tasks they will get checked only when there is nothing else to do.

The blocked list helps when priority is not important so that we can rotate the priority queue. However, the operation of taskvine itself now depends on priorities (resource exhaustion and recovery tasks).

Thus, if we add a blocked list we should remove priorities from taskvine, as otherwise it is very hard to predict what taskvine is doing. Since priorities are important to users, I don't think this is the way we want to go.

@btovar
Copy link
Member

btovar commented Nov 10, 2025

And also, I guess we are talking different complexities... From my perspective adding a second data structure increases the complexity of the code, as we need to make sure to maintain it correctly across all possible state transitions. We may be able to do it today when we have fresh in our minds, but not in a year when we forgot what the code was doing.

@JinZhou5042
Copy link
Member Author

That's a fair argument, adding an additional data structure would increase the maintainance complexity in the future... I think we can go with that. Then if we lower the priority of tasks when they are found to be ineligible, the iterating cursor doesn't appear to be helpful. Do we want to get rid of that feature and stick with peaking at the top?

@btovar
Copy link
Member

btovar commented Nov 11, 2025

What if we add the blocked list but only for logical blocks? If an input file is not there, then the task is put in the block list.
The only responsibility of rotate_blocked_tasks is to see if the task can be moved to the ready list. Tasks without libraries, or missing fixed inputs should not be put in the blocked list, they should be returned immediately instead. Tasks without resources should not be put in the blocked list, as when resources become available tasks with less priority may be scheduled instead.

If this makes sense, then the changes in this pr should be reduced, e.g., expiring_tasks function should not be removed/changed. You may want to add different return values to consider_task with an enum so you know if it was logical, resources, etc. what prevented the task to be scheduled.

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Nov 11, 2025

I think this is a good approach, and I definitely understand that there are too many things going on in the scheduling function so we need to be very circumspect! Returning tasks with inputs missing or library missing seems to be the direction we'll have to take sooner or later. Given that the implementation requires adhering the overall architecture and some coding philosophy behind it, I'm afraid that I couldn't get it right. Hou about you check it out later? That way we might have a more clear idea of what to do here!

@btovar
Copy link
Member

btovar commented Nov 11, 2025

Sounds good! Like me make a sketch to see if it does what you need!

@dthain
Copy link
Member

dthain commented Nov 11, 2025

Jin, I would like for you to keep the responsibility of implementing this. You and Ben are having a constructive discussion about how it should work. I want you guys to come to a shared understanding, and then Jin should implement it.

@btovar btovar mentioned this pull request Nov 11, 2025
7 tasks
@btovar
Copy link
Member

btovar commented Nov 11, 2025

Got it!

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Nov 12, 2025

To start the ball rolling, this is the new runtime model I'm sketching out:

image

We have three participants:

  • Executor: The end user, who could be a pure TaskVine client or a higher-level graph executor.
  • Ready Queue: Stores runnable tasks, whose runtime requirements have all been satisfied, and for which at least one suitable worker is available. Tasks are maintained and considered for eligibility strictly according to their priority values.
  • Blocked List: Stores unrunnable tasks constrained either by a missing input or by the absence of any suitable worker. Tasks are maintained in a rotating list, and are popped one by one to verify both their expiry conditions and eligibility.

And four actions:

  • Submit: A user can submit a task directly.
  • Demote: A ready task is failed to be submitted to a worker, for whatever reason.
  • Promote: A blocked task becomes eligible when all requirements are met.
  • Expire: A blocked task expires if certain conditions are met, such as the absence of a library, a lost input file, or a missing fixed-location requirement. Expired tasks are then returned to the user.

Some upsides of this design:

  • It's easy to implement the idea of returning tasks with missing inputs/library to the user via the blocked queue. This is also extendable to other conditions where we want to expire tasks under certain conditions such as excceded start/end time.
  • Each component performs a specific function: the ready queue relentlessly dispatches tasks from the top of the queue, while the blocked queue handles task ineligibility and makes rational decisions.
  • There is no need to maintain cumbersome or confusing cursors and reset for emurative reasons.

Though I definitely have no objection if we decide to use the one-queue approach, the added complexity and future maintenance issues also need to be considered.

@btovar
Copy link
Member

btovar commented Nov 12, 2025

Jin, before you continue adding to this pr, please split it into smaller chunks. For example, the first chunk should only add the block list with no other changes to the expiring list function. Since this is a structural change, lets split it into changes that we can more easily evaluate and think about.

@JinZhou5042
Copy link
Member Author

Sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants