Skip to content

feat: align with FlowMesh; add PermissionChecker + ResourceRegistrar#1

Draft
kaiitunnz wants to merge 19 commits into
mainfrom
kaiitunnz/feat/permissions-and-registrar
Draft

feat: align with FlowMesh; add PermissionChecker + ResourceRegistrar#1
kaiitunnz wants to merge 19 commits into
mainfrom
kaiitunnz/feat/permissions-and-registrar

Conversation

@kaiitunnz
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz commented May 20, 2026

Purpose

Two coupled changes that together make the plugin's authorization story explicit and aligned with current FlowMesh.

  1. Docs + identity cleanup. Strips the stale "FlowMesh V2" framing, removes the dead V1 scope-vocabulary filter from LumidIdentityProvider (FlowMesh has no require_scope path; the filter was misleading), documents the FLOWMESH_API_KEY operational caveat that this plugin imposes when it's the sole IdentityProvider, and swaps the README's pip install loading snippet for the two canonical flowmesh stack deployment patterns.

  2. PermissionChecker + ResourceRegistrar. Implements the two hooks the plugin has been advertising. ResourceRegistrar mirrors FlowMesh's resource lifecycle into a SQLite ownership table; PermissionChecker reads it to gate access. Defines the concrete scope vocabulary lum.id PATs mint against — admin scopes bypass, kind-level scopes gate creation, ownership gates concrete-id access. The ACL DB lives under FlowMesh's FLOWMESH_PLUGIN_DATA_DIR mount so it survives restarts.

Depends on the matching FlowMesh PR adding FLOWMESH_PLUGIN_DATA_DIR to the stack compose template.

Changes

  • Identity + docs cleanup — drop the V1-vocabulary scope filter from LumidIdentityProvider, strip "V2" framing from the README and module docstring, document the FLOWMESH_API_KEY operational caveat, swap the loading snippet for the canonical flowmesh stack patterns, resolve hook deps from PyPI.
  • ACL storage (new acl.py) — SQLite-backed OwnershipStore keyed on (kind, id). Single-table schema; upsert semantics; startup TTL prune for stale rows.
  • ResourceRegistrar (new registrar.py) — mirror FlowMesh's resource lifecycle into the ACL.
  • PermissionChecker (new permissions.py) — admin-scope bypass; kind-level scope checks; concrete-id ownership lookup; SYSTEM gets a read-scope bypass.
  • install() becomes @asynccontextmanager — opens the engine against LUMID_ACL_DB_PATH, bootstraps schema, prunes stale rows, yields bindings, disposes on shutdown.

README gains a "Scope vocabulary" section enumerating the five scopes the PermissionChecker enforces, and rows for LUMID_ACL_DB_PATH / LUMID_ACL_TTL_DAYS.

Test Plan

uv sync --all-extras
uv run pytest
uv run ruff check src tests
uv run mypy src

End-to-end against a running FlowMesh stack: not run — requires the matching FlowMesh PR (adding FLOWMESH_PLUGIN_DATA_DIR) to land first so the default DB path is mountable. Will retest live once both PRs merge.

Test Result

63 passed in 0.47s     # pytest
All checks passed!     # ruff
Success: no issues found in 10 source files   # mypy --strict

Follow-ups

  • Upstream change to FlowMesh's hook surface so the plugin can do V1-style "diff against the authoritative resource registry" cleanup at boot. Either (a) replay register() at boot for every persisted resource, or (b) expose a "list live resource IDs by kind" hook. Until then, the TTL prune is the best we can do.
  • Live e2e test once the FlowMesh FLOWMESH_PLUGIN_DATA_DIR PR merges.

kaiitunnz and others added 19 commits May 20, 2026 15:21
The V1 lum.id host enforced a fixed scope vocabulary
(`workers:register`, `results:read`, etc.) via route guards. FlowMesh has
no scope-based gating — authorization runs through `PermissionChecker`
hooks instead, and nothing in the server reads `PrincipalContext.scopes`.
The `flowmesh:`-prefix mapping plus `ALLOWED_SCOPES` filter was therefore
dead code that also misled the README into promising a behavior FlowMesh
no longer has.

Drop both. lum.id scopes now flow onto `PrincipalContext.scopes`
verbatim, where any plugin-supplied `PermissionChecker` can read them.
The README still framed this as a "FlowMesh V2" plugin — a label that
stems from internal miscommunication and that FlowMesh's own docs never
use. The "Loading" section also told operators to `pip install` into an
unspecified Python env, which doesn't match the canonical `flowmesh
stack` deployment patterns.

Three substantive updates beyond the V2 cleanup:

- Add an `FLOWMESH_API_KEY` env-var row. Once this plugin is the sole
  `IdentityProvider`, that key must itself be a token we can resolve
  (lum.id JWT or `lm_pat_*`). Workers send it as their bearer on every
  server call, and the server resolves it at boot to obtain the system
  principal that drives `ResourceRegistrar` calls. An unresolvable key
  falls back to a synthetic admin and breaks worker auth.
- Replace the single `pip install` snippet with the two canonical
  patterns from `FlowMesh/docs/PLUGINS.md`: bind-mount via
  `FLOWMESH_PLUGIN_DIR`, then an overlay Dockerfile that bakes the wheel
  into a derived server image.
- Document the email-cache TTL (24 h) and capacity (10 k) on the
  `IdentityProvider` row, mirroring the introspect cache's annotation.
Both packages are now published; drop the `[tool.uv.sources]` git pins
so the existing `>=0.1.0` constraints resolve from PyPI like every other
dep. Lockfile regenerated at lumid-hooks==0.1.0 and flowmesh-hook==0.1.0.
PermissionChecker and ResourceRegistrar need a persistent (kind, id) ->
principal_id table to track who owns which resource. This adds the
storage layer in isolation; the hooks that read and write through it
follow.

`OwnershipStore` wraps an async SQLAlchemy sessionmaker with set/get/
delete/list_ids_for_principal/prune_older_than. `set` is an upsert so
re-registering a resource updates the owner. `prune_older_than` is the
startup cleanup for stale rows; FlowMesh does not replay register() at
boot, so this TTL is the best we can do without an upstream API for
listing live resource IDs.

`open_store` is the async ctx-manager `install()` will use — opens the
engine, bootstraps the schema, yields the store, disposes on exit.
Listen to FlowMesh's resource lifecycle events and mirror them into the
ACL ownership table. `register` upserts (kind, id) -> principal_id;
`deregister` removes the row. Kind-level refs (id is None) are no-ops
with a logged warning — they shouldn't reach a registrar but we don't
want to crash if the server ever fires one.
Concrete scope vocabulary (defined by this plugin, minted on lum.id
PATs):

  *, flowmesh:*, flowmesh:admin  -> admin bypass everything
  flowmesh:workflows:write       -> create workflows  (kind-level WRITE)
  flowmesh:nodes:write           -> register nodes
  flowmesh:workers:write         -> register workers
  flowmesh:system:read           -> read SYSTEM (cluster metrics)

For concrete resource ids, ownership is the gate — the principal who
created the resource (via ResourceRegistrar.register) is allowed; others
are denied. SYSTEM is the exception: `flowmesh:system:read` grants read
on any SYSTEM resource regardless of ownership.

TASK and RESULT have no kind-level scope because tasks are created via
workflow submission and result ownership is inferred from the owning
task — both reduce to concrete-id ownership checks.

`accessible_ids` returns the principal's owned ids for list endpoints,
or `None` (no filter) for admins.
`install()` becomes an `@asynccontextmanager`: opens the ACL SQLite
engine, bootstraps the schema, prunes rows older than
LUMID_ACL_TTL_DAYS (default 90; 0 disables), yields a BaseBindings
carrying the existing identity / supplier / usage / submission hooks
plus the new permission_checker and resource_registrar, then disposes
the engine on FastAPI shutdown.

The default DB path is `/app/plugin-data/lumid_acl.sqlite` — the
writable mount FlowMesh exposes via FLOWMESH_PLUGIN_DATA_DIR. Operators
override via LUMID_ACL_DB_PATH; tests point at a tmp_path.
Adds rows for the two new hooks in the "What it provides" table, a
"Scope vocabulary" section enumerating the five scopes lum.id PATs mint
against, and the LUMID_ACL_DB_PATH / LUMID_ACL_TTL_DAYS env vars. Also
notes that install() is now an async ctx-manager and that the default
ACL SQLite path lives under FlowMesh's FLOWMESH_PLUGIN_DATA_DIR mount.
A non-admin principal needs a corresponding `:read` scope to call a
kind-level READ endpoint (`flowmesh:workflows:read`,
`flowmesh:tasks:read`, `flowmesh:results:read`, `flowmesh:nodes:read`,
`flowmesh:workers:read`). `accessible_ids` still filters the returned
set to the principal's owned ids, and concrete-id access stays
owner-only — only admin sees resources they don't own.

The existing `flowmesh:system:read` is now a regular entry in the same
policy table, with the same kind-level semantics.
The ACL is now keyed by (kind, id, principal_id), so multiple principals
can hold a grant on the same resource. The store gains `grant`, `revoke`,
`has_grant`, and `delete_resource` (the deregister path wipes every
grant on the resource). The PermissionChecker concrete-id branch becomes
a grant-membership check; `accessible_ids` returns the principal's
granted ids, including resources shared with them.

A composite `(principal_id, kind)` index replaces the standalone
`principal_id` index so `list_ids_for_principal` is fully covered.

`revoke()` is implemented but unwired — there is no grant/revoke API
yet; FlowMesh's `register()` is still the only writer today.
FLOWMESH_API_KEY is FlowMesh's own concern, not this plugin's, so it
shouldn't appear in the plugin's env-var table or the Loading example.

The Loading section is rewritten to match what actually works:
`flowmesh stack up` auto-imports anything under `${FLOWMESH_PLUGIN_DIR}`
named in `FLOWMESH_PLUGINS`, so the bind-mount path is just "drop the
source tree in" — no thin loader. The overlay image path is unchanged.

Also drops a redundant email-cache annotation on the IdentityProvider
row and tightens the LUMID_ACL_TTL_DAYS note.
A long-running worker (or workflow) used to lose its grant on the next
FlowMesh restart past LUMID_ACL_TTL_DAYS — the wall-clock prune dropped
the row even though the resource was still live. The host-driven
reconcile sweep replaces that with a stronger guarantee: FlowMesh
batches every live ResourceRef into a single `refresh` call, then
`purge_stale` drops whatever the sweep didn't touch.

- `GrantStore.touch_resources(refs)` does a single bulk UPDATE keyed by
  `(kind, id)`, refreshing every principal's grant on the listed
  resources — multi-principal-safe.
- `GrantStore.delete_unrefreshed(session_start)` clears rows whose
  `granted_at` predates the sweep.
- `LumidResourceRegistrar` takes `session_start` (captured in
  `install()` after schema bootstrap). `refresh` translates the batch
  into a `touch_resources` call; `purge_stale` calls
  `delete_unrefreshed`.
- `LUMID_ACL_TTL_DAYS` and `prune_older_than` are gone.

Requires lumid-hooks 0.2.0 for the new Protocol methods. The
`tool.uv.sources` entry pointing at `../lumid.hooks` is temporary —
drop it once 0.2.0 is on PyPI.
Review findings on the reconcile work:

- `GrantStore.touch_resources(refs)` -> `touch_resources(pairs)` — the
  parameter takes `(kind, id)` tuples, not `ResourceRef` instances; the
  old name implied otherwise.
- `LumidResourceRegistrar.refresh` switches to `Collection[ResourceRef]`
  to match the tightened lumid-hooks 0.2.0 Protocol signature, and logs a
  debug line when it drops kind-level refs (parity with the warnings on
  `register`/`deregister`).
- Test helper `_backdate(principal_id: str | None)` split into
  `_backdate_one` and `_backdate_all`; the implicit branching on a
  None-overloaded arg was a smell.
- Add coverage for two reconcile shapes the existing tests didn't hit:
  sweep against an empty store is a no-op, and a second sweep within
  the same boot doesn't drop grants the first sweep just refreshed.
Surfaces the temporary override in grep when 0.2.0 ships to PyPI.
Match lumid-hooks 0.2.0's single-method Protocol: one atomic
`reconcile(resources, logger)` replaces the two-call sweep so a
mid-sweep failure can't half-wipe the ACL.

- `GrantStore.reconcile(pairs, session_start)` runs the UPDATE
  (touch refreshed grants) and the DELETE (drop anything older than
  `session_start`) in a single transaction. On error the transaction
  rolls back, leaving the store unchanged. Replaces `touch_resources`
  and `delete_unrefreshed`.
- `session_start` stays on the registrar so it's captured at plugin
  load time, not when the host invokes `reconcile`. Grants written by
  other startup paths (e.g. supervisor registration) between load and
  the sweep have `granted_at > session_start` and survive.
- `LumidResourceRegistrar.reconcile(resources, logger)` flattens
  refs to `(kind, id)` pairs, logs kind-level drops, and reports
  touched/deleted counts at INFO.
- Tests cover: live grants survive (long-running resources), stale
  grants drop, empty batch wipes pre-session rows, grants written
  after `session_start` survive (the host-race protection), and a
  mid-transaction failure rolls back.
lumid-hooks 0.2.0 is released on PyPI, so the editable path override
from `[tool.uv.sources]` is no longer needed. uv now resolves the pin
from the registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The grant store's only persistence need is a single-table SQLite file,
which the stdlib `sqlite3` module covers directly. Dropping the
SQLAlchemy and aiosqlite deps means the plugin's runtime deps
(`httpx`, `pydantic`, `fastapi`, `lumid-hooks`, `flowmesh-hook`) are
all already present in the FlowMesh server image, so the bind-mount
deployment path no longer needs an overlay Dockerfile.

`GrantStore` keeps its public API. One `sqlite3.Connection` is opened
in WAL + autocommit and shared across all ops; an `asyncio.Lock`
serialises access and queries run in `asyncio.to_thread`. `reconcile`
uses explicit `BEGIN`/`COMMIT`/`ROLLBACK` for the same atomic-on-
failure contract. The README's Loading section collapses to the single
bind-mount path.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
`GrantStore` already serialises every operation through an
`asyncio.Lock`, so SQLite's WAL concurrency (non-blocking readers vs.
one writer) is mooted before the engine sees it. Defaulting to
rollback-journal mode keeps a single file at rest — no `-wal`/`-shm`
sidecars to back up or trip up external readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Trim docstrings + README to direct declarative statements. Cut
justification of absent design choices ("no SQLAlchemy / aiosqlite
dependency", "No overlay image needed", "No locks needed" — for the
WAL paragraph in `acl.py`), narrative deliberation ("With this plugin
as the sole IdentityProvider, every authenticated principal came
through our resolve path…"), and contrastive rebuttals ("so a partial
sweep can't wipe live grants", "(admin aside)", "they shouldn't reach
a registrar in practice, but…"). Keep the load-bearing invariants —
single-atomic-transaction reconcile, asyncio.Lock + to_thread for the
SQLite connection, kind-level scope fallback policy — stated once
each.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant