Skip to content

feat: extend capability gate to all task types#76

Merged
kaiitunnz merged 4 commits into
mainfrom
kaiitunnz/feat/extend-capability-gate
Jun 19, 2026
Merged

feat: extend capability gate to all task types#76
kaiitunnz merged 4 commits into
mainfrom
kaiitunnz/feat/extend-capability-gate

Conversation

@kaiitunnz

@kaiitunnz kaiitunnz commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Implements the follow-up noted in #74. That PR added a capability gate for the single SSH case; every other task type was still selected on hardware fit alone, so a worker that lacks an executor (no Docker for SSH, absent GPU/training/omni dependencies, or an init failure) remained a candidate for that type and the task failed inside the runner — falling back to the default executor or hitting a spec-type mismatch — instead of being routed to a worker that can actually serve it. This generalizes the gate to all task types so the dispatcher never hands a worker a task its executors can't run.

It also closes a related gap raised in review: a single inference taskType maps to multiple backends (vLLM vs HF transformers) chosen at runtime, which the capability gate alone can't tell apart. The server now validates inference specs at submit — and again at dispatch once template placeholders resolve — so a vLLM-only task can't slip onto a worker that would fail it.

Changes

  • Shared / SDKWorkerCapabilities carries supported_task_types (replacing the SSH-only ssh flag), surfaced on the SDK Worker model so flowmesh worker list shows each worker's serviceable types.
  • Worker — each Executor declares the task types it serves; the worker advertises the union over the executors that initialized.
  • Servercapability_satisfies gates on taskType ∈ supported_task_types, replacing the SSH isinstance special-case.
  • Inference validation — a shared InferenceBackend classifier (transformers / vllm / auto) drives the worker's executor selection and a TaskSpec.validate_dispatchable() hook that rejects a vLLM-backed inference task without a GPU, enforce_cpu combined with a vLLM config, and an adapter that specifies none of path/url/task_id. The hook runs at submit and again pre-dispatch on the resolved spec.
  • Tests / Docs — coverage for the declarations, the union, the gate, and the inference validation; architecture and worker docs describe capability-gated dispatch.

Design

Capability is derived from the worker's initialized executors rather than config: an executor whose runtime dependency is missing is never constructed, so the advertised set reflects what the worker can actually run. Each executor declares its task types, the worker advertises the union, and the dispatcher does a single membership test. A worker reporting an empty set is treated as incapable, so during a rolling upgrade a task waits for an upgraded worker rather than risking a wrong dispatch.

For inference, the capability gate stays at taskType granularity; the GPU/CPU split is handled separately. The InferenceBackend classifier is the single source of truth for both the worker's executor selection and the validation, so the two can't drift. Only an explicit vLLM backend (a vllm config or adapters) is treated as GPU-required — the transformers executor runs on either device, and the unhinted auto case falls back to transformers, so neither is gated.

Spec validation lives on a validate_dispatchable() hook on the spec base — a no-op by default, overridden by inference for its own invariants. The parser runs it at submit (where a templated enforce_cpu defers, since the backend can't be determined until it resolves) and the dispatcher runs it again on the fully-resolved strict spec, so a deferred case is caught once resolved rather than failing late in the runner. The hook is scoped to spec-internal invariants; config-dependent checks like SSH access-mode (which depend on server feature flags) stay server-side.

Test Plan

uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
uv run pre-commit run --all-files

Two end-to-end suites on a freshly built stack (FLOWMESH_VERSION=cap-gate):

  • Capability gate — the executor-init → registration → registry → dispatch path a non-SSH type cannot reach in unit tests: on a CPU-only fleet a ppo task with no GPU hardware requirement fails fast (excluded on capability though its hardware fits) while a hardware-identical echo task runs; on a mixed CPU+GPU fleet the ppo task routes to the GPU worker.
  • Inference validation — over the REST submit endpoint (server only): a vLLM-backed inference task without a GPU, an enforce_cpu + vLLM task, and an adapter lacking a source are each rejected at submit; a vLLM task that declares a GPU is accepted.

Test Result

$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py   # All passed
$ uv run pre-commit run --all-files          # isort / black / ruff / mypy / codespell passed
  • Capability-gate e2e: 16/16 assertions passed — the CPU-only fleet failed the ppo task fast (never dispatched, no executor error) and ran the echo control to DONE, and the mixed fleet routed the ppo task to the GPU worker, never the idle CPU one. The CPU worker advertised inference/embedding despite its default executor being subprocess-wrapped, confirming the wrapped-executor path is accounted for.
  • Inference-validation e2e: 4/4 assertions passed — the three bad specs were rejected at submit with their validation messages; the GPU-declaring vLLM task was accepted.

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit on the changed files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that the worker and affected server/SDK tests pass locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

Base automatically changed from kaiitunnz/chore/bump-trl to main June 18, 2026 03:56
#74 gated dispatch on a single SSH capability. Generalize it: each
Executor declares the task types it serves, the worker advertises the
union over the executors it initialized as WorkerCapabilities.supported_task_types,
and the dispatcher routes a task only to workers advertising its type.
A worker missing an executor (e.g. no Docker for SSH, or absent GPU/training
deps) is no longer handed a task it would fail inside the runner.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/feat/extend-capability-gate branch from 4eeff4d to e73f90f Compare June 18, 2026 03:59
@kaiitunnz kaiitunnz marked this pull request as ready for review June 18, 2026 04:50
@kaiitunnz kaiitunnz requested a review from timzsu as a code owner June 18, 2026 04:51

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment. PTAL

Comment thread src/worker/executors/transformers_executor.py
A vLLM-backed inference task could previously be accepted and then fail
late in the runner. Classify the resolved backend once (InferenceBackend:
TRANSFORMERS / VLLM / AUTO), shared by the worker's executor selection and
the server parser, and reject at submit: a vLLM backend without a GPU, an
enforce_cpu task that also configures vLLM (silently ignored otherwise),
and an adapter that specifies neither path nor url. Templated enforce_cpu
placeholders defer, and an unhinted model (AUTO) is left alone since the
runner falls back to the transformers executor.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz requested a review from timzsu June 18, 2026 20:15

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments.

Comment thread src/server/task/parser.py Outdated
Comment thread src/server/task/parser.py Outdated
Generalize the inference checks into TaskSpec.validate_dispatchable() (a
no-op on the base spec, overridden by inference), called both at submit
(parser) and again after placeholder resolution at dispatch on the resolved
strict spec. Submit-time validation defers when enforce_cpu is an unresolved
template placeholder, so re-checking post-resolution catches a templated
task that resolves into an invalid combination instead of failing late in
the runner. Also accept an adapter's task_id (upstream-artifact reference)
as a valid source, matching the vLLM LoRA executor.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please merge after the CI passes

@kaiitunnz kaiitunnz merged commit 49ff4e2 into main Jun 19, 2026
11 of 12 checks passed
@kaiitunnz kaiitunnz deleted the kaiitunnz/feat/extend-capability-gate branch June 19, 2026 05:50
@kaiitunnz kaiitunnz mentioned this pull request Jun 22, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants