Add multi-tenant deployment and self-service portal#13
Merged
atsyplikhin merged 24 commits intomainfrom Apr 9, 2026
Merged
Conversation
cc78e33 to
76b7d8a
Compare
Enable NATS JWT per-tenant subject scoping so multiple groups can share one Device Connect infrastructure without cross-tenant interference. - Fix gen_creds.sh: replace incorrect `fabric.>` prefix with `device-connect.>`, add `--tenant`/`--privileged`/`--nats-host` flags - Add setup_deployment.sh for one-time infra bootstrap - Add manage_tenants.sh CLI (create, create-batch, add-device, list, reload-nats) with distributable credential bundles - Add docker-compose-multitenant.yml (NATS JWT + etcd + multi-tenant registry) - Add verify_tenants.sh smoke test for cross-tenant isolation - Add security_infra/README.md guide - Update server README with multi-tenant section and updated account model diagram
Replace shell-script workflow with a web UI for multi-tenant management. Users self-register (1 user = 1 tenant), create device credentials, and observe devices coming online via live polling. Admin can bootstrap the platform, view any tenant's dashboard, run health checks, and reload NATS. - aiohttp + Jinja2 + Tailwind CSS (CDN) + htmx for server-rendered UI - Session auth with bcrypt passwords, user accounts stored in etcd - nsc CLI wrapper (asyncio subprocess) for JWT credential generation - Live device panel with 3-second htmx polling from etcd registry - Admin "View as User" for read-only tenant inspection - Persistent etcd volume for data survival across container restarts - Docker Compose portal service on port 8080
…er script - Fix signup/login rendering both dashboard and form by returning 200 with HX-Redirect header instead of 302 (which fetch follows silently) - Show actual server IP in connection instructions instead of Docker hostname, remove DEVICE_CONNECT_ALLOW_INSECURE from instructions - Add downloadable starter device script with commented-out @rpc and @periodic sections - Add "Getting Started" section with venv setup, pip install, and run instructions - Auto-detect tenant from credentials file in DeviceRuntime so devices publish to the correct namespace without manual TENANT config
The mismatch validation ran before auto-detection could override the random placeholder device_id. Now only validates when device_id was explicitly provided by the caller.
…tails - Fix registry_client to read nested etcd fields (status.location, identity.device_type, status.availability) instead of top-level - Switch registry service to wildcard subscriptions (device-connect.*) so dynamically created tenants work without restart - Add expandable device detail rows in live devices table showing functions, events, identity, and raw JSON - Pause auto-refresh while detail rows are expanded to prevent flicker
- Fix function parameter display to read JSON Schema properties instead of iterating the schema object (was showing extra commas) - Add "Try" button on each RPC function with inline param form and response display - New nats_rpc service for portal-to-device JSON-RPC via NATS request-reply using registry privileged credentials - Add @emit event example to starter script alongside @rpc and @periodic
- SSE endpoint streams device events in real-time via NATS subscription
- Clickable "Live log" button on each event opens a streaming log panel
with expandable JSON details, auto-scroll, capped at 100 entries
- Extract shared NATS connect() helper from nats_rpc for reuse
- Last Seen column now shows relative time ("just now", "5s ago") from
heartbeat timestamps instead of static registration time
- Smart Greenhouse demo: 3-device bundle (soil sensor, irrigation pump, greenhouse controller) showing @periodic, @emit, @on, @rpc, and cross-device invoke_remote - Live event log via SSE: click events to stream them in real-time - Fix credential bundle to use public server IP instead of Docker hostname - Demo README uses real tenant name, credential filenames, and server IP - Relative "last seen" timestamps from heartbeats - Fix greenhouse controller to use list_devices + invoke_remote pattern
- Admin RPC/event endpoints now accept ?tenant= query param override - Admin tenant detail template passes viewed tenant to all API calls - Fix count_all_devices to filter out user records from etcd prefix - Add auto-refreshing tenants table to admin dashboard - Add full interactive JS (expand, invoke, event log) to admin tenant view
Admins can now choose NATS or Zenoh when bootstrapping multi-tenant infrastructure via the portal setup wizard. Zenoh tenant isolation uses mTLS client certificates with the Zenoh 1.0 ACL plugin — each device gets a cert with CN used for broker-enforced key-expression rules, providing isolation analogous to NATS JWT subject permissions. Key changes: - MessagingBackendService abstraction with NatsBackend/ZenohBackend strategies - Zenoh PKI: CA + server + per-device client cert generation via openssl - Zenoh ACL config management for tenant-scoped key-expression rules - All portal views refactored to use backend abstraction (zero NATS regression) - Backend-aware credential bundles, setup UI with backend selector - Docker Compose for Zenoh multi-tenant deployment
Extends the portal bootstrap to support NATS, Zenoh, or MQTT backend selection. MQTT tenant isolation uses Mosquitto password file + ACL (broker-enforced), with auto-generated per-device credentials. New services: mqtt_acl (password/ACL management), mqtt_admin (SIGHUP reload), mqtt_rpc (RPC via edge MQTTAdapter), mqtt_backend (strategy implementation). Includes Docker Compose for multi-tenant MQTT infra.
Devices running on laptops that sleep lose their NATS connection. After wake, registry queries time out, etcd leases expire silently, and devices disappear from the portal. This adds multi-layer resilience: SDK (device-connect-edge): - RegistryClient._request(): retry 3x on RequestTimeoutError with exponential backoff (1s→2s→4s), backend-agnostic - DeviceRuntime heartbeat: exponential backoff on reconnect wait loop, trust _register() internal infinite retry on reconnect - DeviceRuntime on_reconnect: re-establish @on event subscriptions via teardown + setup with exponential backoff retry - Extract _build_registration_params() shared by _register() and new requestRegistration built-in RPC handler Registry service (device-connect-server): - Registry-initiated re-registration: when heartbeat arrives for a device with no lease, registry pulls full registration via requestRegistration RPC to the device's .cmd subject - refresh() recovers lost lease handles after service restart by recreating lease from existing etcd data - Heartbeat handler passes tracked TTL to refresh() for lease recovery Greenhouse demo: - Increase TTL to 60s (survives brief laptop sleep) - Add retry with backoff to _find_pump() discovery
Address all code review findings from the resilience commit: - Extract _do_register() shared helper to deduplicate registration logic between _make_register_handler and _make_hb_handler - Validate pull-registration response through RegisterParams schema - Add has_lease() to DeviceRegistry for proper encapsulation (replaces direct _REGISTRY.leases access) - Add _subscription_lock to prevent concurrent resubscription - Add named constants _DEFAULT_TTL and _PULL_REGISTRATION_TIMEOUT - Remove unnecessary retries=0 guard in RegistryClient._request - Add tests for: RegistryClient retry, requestRegistration RPC, _build_registration_params, pull-registration path, has_lease, refresh lease recovery (10 new tests)
33a8d40 to
f8c7bf6
Compare
Replace the weak default password (qwe123) and session secret with auto-generated random values via secrets module. The generated admin password is logged at WARNING level on first boot. Rename docker-compose-multitenant.yml to docker-compose-multitenant-nats.yml to match the -zenoh and -mqtt siblings.
- Fix cross-tenant credential theft: download_credential and download_bundle now verify the requesting user's tenant matches the resource (admins bypass) - Fix XSS in HTML error responses: escape exception messages with html.escape() - Log tenant creation failures during signup instead of silently swallowing - Add SESSION_SECURE_COOKIE env var to enable secure flag on session cookies - Avoid persisting admin password in log aggregation (print to stdout instead) - Cache per-tenant handlers in registry wildcard subscriptions to avoid re-creating closures on every heartbeat/register/discovery message
…l prevention - Add validate_name() for tenant/device names used in subjects/ACLs/certs/paths - Escape HTML in admin views to prevent XSS via backend results/error messages - Prevent path traversal in credential file lookups via resolve + is_relative_to - Validate certificate CNs in zenoh_pki to prevent OpenSSL subject injection - Set 0o600 permissions on credential files (NATS, Zenoh, MQTT backends) - Validate port number and backend type in admin setup - Bound event stream queue to 256 to prevent memory exhaustion - Fix tenant resolution in live_devices_fragment to use _resolve_tenant helper
…review findings - Enforce tenant ownership check in device_detail_page (prevents cross-tenant access) - Fix TOCTOU race in _subscription_lock: use acquire_nowait() instead of locked() check - Reorder signup flow: provision tenant before creating user, fail early on error - Validate tenant_override query param in admin view-as-user - Use atomic os.open() with 0o600 for credential file writes (all 3 backends) - Add validate_name() call in bundle creation - Fix shell script injection: read JSON via stdin instead of embedding filename in python -c - Mount security_infra as read-only in all docker-compose files
soupat
previously approved these changes
Apr 8, 2026
HTMX polls /api/devices/live every 3s; when the session expires, the 302 redirect to /login was followed by HTMX and the login page HTML was injected into the live-devices div. Use HX-Redirect header instead so HTMX performs a full-page navigation to the login page.
## Summary - Adds **Step 4 — Run all devices and orchestrate using an agent** to the portal's `/devices` page, with a one-click `run_agent.py` download. - New endpoint `GET /api/devices/agent-script` serves a self-contained Strands + OpenAI agent script that connects to the user's tenant, discovers devices, batches incoming events in 12s windows, and lets the LLM call `list_devices`, `get_device_functions`, and `invoke_device` to react. - Defaults inference to Arm's internal OpenAI proxy. Includes an `httpx` monkey-patch (gated on `OPENAI_INSECURE=1`) so users don't have to wrangle the proxy's internal CA bundle. - Step 4 description links to the proxy portal for API key generation and notes the Arm VPN requirement. ## Why The portal already walks users through credentials → starter device script → `python my_device.py`. The missing piece is the agent loop that closes the system: an LLM that observes events and calls back into devices. This adds that final step end-to-end with a copy-paste flow. ## Test plan - [ ] Visit `/devices`, confirm Step 4 renders with the purple `run_agent.py` button. - [ ] Click the button, confirm `run_agent.py` downloads. - [ ] In a venv, `pip install` the three deps from the snippet, set env vars (creds + Arm proxy token + `OPENAI_INSECURE=1`), run `python run_agent.py`. - [ ] Verify `Agent ready — discovered N devices` log line. - [ ] Start a soil sensor in another terminal; verify the agent batches the `soil_reading` events and calls a tool on the next batch.
setup_deployment.sh writes host-absolute paths into nsc.json, but the container mount point differs, causing nsc to fail to find the store. Rewrite store_root to the expected container path before each nsc call.
Add root `/` route that redirects to /dashboard, /admin, or /login based on session state. Redirect logged-in users away from /login and /signup pages. Add duplicate-user early check and improved tenant provisioning error messages. Remove :ro from security_infra volume mounts to allow runtime writes.
soupat
previously approved these changes
Apr 9, 2026
Every heartbeat was writing to etcd even when device status hadn't changed, causing unbounded revision growth that hit the 2GB default quota within days. Strip transient `ts` field from heartbeat data before update_status(), and skip etcd writes when status is unchanged. Also add auto-compaction (1h periodic) and 8GB quota to all etcd docker-compose configs as a safety net.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
docker compose -f infra/docker-compose-multitenant.yml up -dand verify tenant isolation with NATS accountsdocker compose -f infra/docker-compose-multitenant-zenoh.yml up -dand verify Zenoh ACL-based isolationdocker compose -f infra/docker-compose-multitenant-mqtt.yml up -dand verify MQTT ACL-based isolationpython -m device_connect_server.portal), complete admin setup, create tenant, signup as usercd packages/device-connect-server && python3 -m pytest tests/ -v