feat: extend capability gate to all task types by kaiitunnz · Pull Request #76 · mlsys-io/FlowMesh

kaiitunnz · 2026-06-17T11:24:27Z

Purpose

Implements the follow-up noted in #74. That PR added a capability gate for the single SSH case; every other task type was still selected on hardware fit alone, so a worker that lacks an executor (no Docker for SSH, absent GPU/training/omni dependencies, or an init failure) remained a candidate for that type and the task failed inside the runner — falling back to the default executor or hitting a spec-type mismatch — instead of being routed to a worker that can actually serve it. This generalizes the gate to all task types so the dispatcher never hands a worker a task its executors can't run.

It also closes a related gap raised in review: a single inference taskType maps to multiple backends (vLLM vs HF transformers) chosen at runtime, which the capability gate alone can't tell apart. The server now validates inference specs at submit — and again at dispatch once template placeholders resolve — so a vLLM-only task can't slip onto a worker that would fail it.

Changes

Shared / SDK — WorkerCapabilities carries supported_task_types (replacing the SSH-only ssh flag), surfaced on the SDK Worker model so flowmesh worker list shows each worker's serviceable types.
Worker — each Executor declares the task types it serves; the worker advertises the union over the executors that initialized.
Server — capability_satisfies gates on taskType ∈ supported_task_types, replacing the SSH isinstance special-case.
Inference validation — a shared InferenceBackend classifier (transformers / vllm / auto) drives the worker's executor selection and a TaskSpec.validate_dispatchable() hook that rejects a vLLM-backed inference task without a GPU, enforce_cpu combined with a vLLM config, and an adapter that specifies none of path/url/task_id. The hook runs at submit and again pre-dispatch on the resolved spec.
Tests / Docs — coverage for the declarations, the union, the gate, and the inference validation; architecture and worker docs describe capability-gated dispatch.

Design

Capability is derived from the worker's initialized executors rather than config: an executor whose runtime dependency is missing is never constructed, so the advertised set reflects what the worker can actually run. Each executor declares its task types, the worker advertises the union, and the dispatcher does a single membership test. A worker reporting an empty set is treated as incapable, so during a rolling upgrade a task waits for an upgraded worker rather than risking a wrong dispatch.

For inference, the capability gate stays at taskType granularity; the GPU/CPU split is handled separately. The InferenceBackend classifier is the single source of truth for both the worker's executor selection and the validation, so the two can't drift. Only an explicit vLLM backend (a vllm config or adapters) is treated as GPU-required — the transformers executor runs on either device, and the unhinted auto case falls back to transformers, so neither is gated.

Spec validation lives on a validate_dispatchable() hook on the spec base — a no-op by default, overridden by inference for its own invariants. The parser runs it at submit (where a templated enforce_cpu defers, since the backend can't be determined until it resolves) and the dispatcher runs it again on the fully-resolved strict spec, so a deferred case is caught once resolved rather than failing late in the runner. The hook is scoped to spec-internal invariants; config-dependent checks like SSH access-mode (which depend on server feature flags) stay server-side.

Test Plan

uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
uv run pre-commit run --all-files

Two end-to-end suites on a freshly built stack (FLOWMESH_VERSION=cap-gate):

Capability gate — the executor-init → registration → registry → dispatch path a non-SSH type cannot reach in unit tests: on a CPU-only fleet a ppo task with no GPU hardware requirement fails fast (excluded on capability though its hardware fits) while a hardware-identical echo task runs; on a mixed CPU+GPU fleet the ppo task routes to the GPU worker.
Inference validation — over the REST submit endpoint (server only): a vLLM-backed inference task without a GPU, an enforce_cpu + vLLM task, and an adapter lacking a source are each rejected at submit; a vLLM task that declares a GPU is accepted.

Test Result

$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py   # All passed
$ uv run pre-commit run --all-files          # isort / black / ruff / mypy / codespell passed

Capability-gate e2e: 16/16 assertions passed — the CPU-only fleet failed the ppo task fast (never dispatched, no executor error) and ran the echo control to DONE, and the mixed fleet routed the ppo task to the GPU worker, never the idle CPU one. The CPU worker advertised inference/embedding despite its default executor being subprocess-wrapped, confirming the wrapped-executor path is accounted for.
Inference-validation e2e: 4/4 assertions passed — the three bad specs were rejected at submit with their validation messages; the GPU-declaring vLLM task was accepted.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit on the changed files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that the worker and affected server/SDK tests pass locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

#74 gated dispatch on a single SSH capability. Generalize it: each Executor declares the task types it serves, the worker advertises the union over the executors it initialized as WorkerCapabilities.supported_task_types, and the dispatcher routes a task only to workers advertising its type. A worker missing an executor (e.g. no Docker for SSH, or absent GPU/training deps) is no longer handed a task it would fail inside the runner. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

One comment. PTAL

A vLLM-backed inference task could previously be accepted and then fail late in the runner. Classify the resolved backend once (InferenceBackend: TRANSFORMERS / VLLM / AUTO), shared by the worker's executor selection and the server parser, and reject at submit: a vLLM backend without a GPU, an enforce_cpu task that also configures vLLM (silently ignored otherwise), and an adapter that specifies neither path nor url. Templated enforce_cpu placeholders defer, and an unhinted model (AUTO) is left alone since the runner falls back to the transformers executor. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

Two comments.

Generalize the inference checks into TaskSpec.validate_dispatchable() (a no-op on the base spec, overridden by inference), called both at submit (parser) and again after placeholder resolution at dispatch on the resolved strict spec. Submit-time validation defers when enforce_cpu is an unresolved template placeholder, so re-checking post-resolution catches a templated task that resolves into an invalid combination instead of failing late in the runner. Also accept an adapter's task_id (upstream-artifact reference) as a valid source, matching the vLLM LoRA executor. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

LGTM. Please merge after the CI passes

Base automatically changed from kaiitunnz/chore/bump-trl to main June 18, 2026 03:56

kaiitunnz force-pushed the kaiitunnz/feat/extend-capability-gate branch from 4eeff4d to e73f90f Compare June 18, 2026 03:59

kaiitunnz marked this pull request as ready for review June 18, 2026 04:50

kaiitunnz requested a review from timzsu as a code owner June 18, 2026 04:51

timzsu requested changes Jun 18, 2026

View reviewed changes

Comment thread src/worker/executors/transformers_executor.py

kaiitunnz added 2 commits June 19, 2026 03:43

chore: bump crawl4ai to 0.9.0 to fix pip-audit warnings

8f43760

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz requested a review from timzsu June 18, 2026 20:15

timzsu requested changes Jun 19, 2026

View reviewed changes

Comment thread src/server/task/parser.py Outdated

Comment thread src/server/task/parser.py Outdated

timzsu approved these changes Jun 19, 2026

View reviewed changes

kaiitunnz merged commit 49ff4e2 into main Jun 19, 2026
11 of 12 checks passed

kaiitunnz deleted the kaiitunnz/feat/extend-capability-gate branch June 19, 2026 05:50

kaiitunnz mentioned this pull request Jun 22, 2026

chore: bump version to v0.1.4 #80

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: extend capability gate to all task types#76

feat: extend capability gate to all task types#76
kaiitunnz merged 4 commits into
mainfrom
kaiitunnz/feat/extend-capability-gate

kaiitunnz commented Jun 17, 2026 •

edited

Loading

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kaiitunnz commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiitunnz commented Jun 17, 2026 •

edited

Loading