The System Status service is a read-only FastAPI endpoint that runs inside the guest VM to expose operational state from a tightly scoped set of systemd units and NVIDIA GPU telemetry commands. It is intended to de-risk the "black-box" nature of the VM by providing authenticated components (admission controller, attestation proxy, etc.) with structured status information without granting shell access or generic command execution capabilities.
| Capability | Description |
|---|---|
| Service inventory | Enumerate the fixed allowlist of managed systemd units (admission controller, system manager, attestation service, k3s server, nvidia-persistenced, nvidia-fabricmanager, and infiniband-config). |
| Service status | Return summarized health derived from systemctl show for an allowlisted unit. |
| Service logs | Tail the latest N log lines (journalctl -u <unit>) with optional time window filtering. |
| GPU telemetry | Surface nvidia-smi output in either default (summary) or -q (detailed) modes with optional GPU index selection. |
| Overview summary | Aggregate all service statuses with the latest nvidia-smi result to produce an "ok"/"degraded" snapshot. |
Future enhancements (e.g., additional units) must be added explicitly to the allowlist to avoid broadening the attack surface.
All responses are JSON and delivered over HTTPS or a Unix Domain Socket based on standard ServerConfig parameters.
By default the Ansible role configures the service to listen on 0.0.0.0:8080 inside the guest over HTTP. TLS variables are left blank so operators can supply their own certificates later if needed. The host bridge script forwards TCP/8080 so the status API is reachable from the host network without extra tunneling.
GET /health- Returns
{"status": "ok"}when the service is responsive.
- Returns
GET /services- Lists the static allowlist: service id, systemd unit name, description.
GET /services/{service_id}/status- Summarizes
LoadState,ActiveState,SubState,MainPID, and recent exit code harvested fromsystemctl show.
- Summarizes
GET /services/{service_id}/logs?lines=200&since_minutes=60- Streams log lines from
journalctl -u <unit>. linesdefaults to 200 and is clamped to [1, 1000].since_minutes(optional) truncates the log window to the last N minutes (1–1440). When omitted, only the latestlinesare returned.
- Streams log lines from
GET /gpu/nvidia-smi?detail=false&gpu=all- Executes
nvidia-smi. detail=trueswaps the command tonvidia-smi -q.gpucan beall(default) or an integer GPU index; only a single index is accepted to keep the interface deterministic.- Output is returned as
{ "stdout": "...", "stdout_lines": ["line1", ...], "stderr": "...", "exit_code": <int> }, making it easier for clients to render the text banner without reprocessing newline escapes.
- Executes
GET /overview- Collects the status for every allowlisted service plus a default
nvidia-smiinvocation. - Returns
{ "status": "ok" | "degraded", "services": [...], "gpu": {...}, "timestamp": "ISO-8601" }. - The
statusisokonly when every service is loaded/active andnvidia-smiexits successfully; otherwise it degrades.
- Collects the status for every allowlisted service plus a default
All other paths return 404.
-
Read-only execution
- Only
systemctl show,journalctl -u, andnvidia-smicommands are ever issued. Parameterization is handled server-side through validated inputs (service ids, bounded integers, boolean flags). subprocesscalls are made withshell=False, preventing shell interpolation or arbitrary redirection.- Each command has a strict timeout (default 10 seconds) and the stdout/stderr is size-limited before returning to the caller.
- Only
-
Allowlist enforcement
- Service ids are resolved against a hard-coded dictionary mapping to systemd unit names (
admission-controller.service,system-manager.service,attestation-service.service,k3s.service, etc.). Requests for unknown ids fail with HTTP 404. - GPU command options are derived from boolean and integer query parameters; textual arguments are never concatenated into the command line.
- Service ids are resolved against a hard-coded dictionary mapping to systemd unit names (
-
Principle of least privilege
- The systemd unit runs as a dedicated
chutesuser (non-privileged account) with membership in thesystemd-journalandvideogroups. It does not require root and is fully confined via a drop-in (ProtectSystem=strict,NoNewPrivileges=true, etc.).
- The systemd unit runs as a dedicated
- Application directories live under
/opt/sek8swith read-only permissions for service users. Device sandboxing is tightened viaDevicePolicy=closedwhile explicitly allowing the NVIDIA control/uvm nodes plus/dev/nvidia[0-9]*and/dev/nvidia-caps/nvidia-cap*sonvidia-smican talk to every GPU without exposing unrelated devices.
- Transport security
- The service reuses the existing
ServerConfigfoundation: TLS can be enabled by providing certificate/key paths, but by default the bridged deployment runs plain HTTP on the isolated host-only network. UDS deployments inherit filesystem ACLs. Future authentication layers (shared secret, mTLS) can be added when consumers require it.
- Operational safeguards
- Log and command outputs are truncated (configurable, default 16 KiB) to minimize potential sensitive data exposure.
- Errors returned to clients omit raw stderr to avoid leaking host paths or kernel details; instead a structured error payload describes the failure mode (timeout, exit code, etc.).
- Determine the final authentication story (e.g., reuse validator signature headers similar to the attestation proxy or rely on mTLS). The initial implementation focuses on the read-only execution layer; transport-level protections can be layered in once the consuming component is chosen.
- Extend the allowlist if additional services (OPA, attestation proxy) need coverage.
- Consider Prometheus metrics (command success/failure counts) if observability gaps appear.