diff --git a/docs/02-gcp-environment-and-cvm-setup-guide.md b/docs/02-gcp-environment-and-cvm-setup-guide.md index 1cfa9dd..77e1533 100644 --- a/docs/02-gcp-environment-and-cvm-setup-guide.md +++ b/docs/02-gcp-environment-and-cvm-setup-guide.md @@ -2,63 +2,224 @@ ### 1. Purpose and Audience -This guide is intended for infrastructure and SRE teams responsible for preparing and maintaining the GCP environment that hosts the dstack-based KMS. It focuses on: +This guide is intended for infrastructure and SRE teams responsible for preparing and maintaining the GCP environment that hosts the dstack‑based KMS. + +It focuses on: - GCP project and IAM setup - Network and firewall configuration -- Confidential VM (TDX/SGX) configuration -- Prerequisites for deploying the KMS image +- Confidential VM (TDX) configuration +- Prerequisites for deploying the KMS image managed by `dstack-cloud` ### 2. Prerequisites -This section should list: +Before deploying the KMS, the following prerequisites should be met. + +#### 2.1 GCP organization and billing + +- A GCP organization or standalone project with billing enabled +- Permission to create and manage: + - Projects (if creating a dedicated project for KMS) + - VPC networks, subnets, and firewall rules + - Compute Engine instances and disks + +#### 2.2 Tools and access + +- Local tools: + - `gcloud` CLI configured with access to the target project + - `dstack-cloud` CLI (for provisioning the KMS CVM and managing the project) + - Optionally Terraform or other IaC tools, if infrastructure is managed as code +- Access: + - Operator accounts with sufficient IAM roles (see Section 3) + - Access to the container registry that will host the KMS image + +#### 2.3 Operator permissions + +Operators who run `dstack-cloud` and manage KMS infrastructure typically require: -- Required GCP organization and billing setup -- Required CLI tools (e.g., `gcloud`, Terraform if used) -- Required roles and permissions for operators +- `compute.instances.*` (for instance creation, deletion, and metadata) +- `compute.disks.*` (for attached persistent disks) +- `compute.networks.*` and `compute.firewalls.*` (for network and firewall rules) +- `iam.serviceAccounts.actAs` for service accounts used by the KMS CVM (if applicable) + +In most environments, this is provided through curated roles (for example, a “KMS operator” custom role or a combination of built‑in roles). ### 3. GCP Project and IAM -Describe how to: +This section describes how to organize the GCP project(s) and IAM roles for the KMS. + +#### 3.1 Project layout + +Options include: + +- **Single project per environment**: one project for each of dev, staging, pre‑production, and production. +- **Shared project with namespaces**: a single project with separate VPCs/subnets and naming conventions for different environments. + +For security and blast‑radius reasons, a separate project per environment is recommended for production deployments. + +#### 3.2 Service accounts + +Define service accounts for: -- Create or select the GCP project(s) used for KMS -- Define service accounts and roles for: - - KMS instances - - CI/CD pipelines - - Monitoring and logging -- Apply least privilege principles for IAM configuration +- **KMS CVM instances** + - Used by the confidential VM hosting the KMS runtime. + - Requires access to: + - Logging/monitoring APIs + - Container registry (if using Workload Identity or image pull permissions beyond what `dstack-cloud` sets up) +- **CI/CD pipelines** (optional) + - Used to build and publish KMS images and to run supporting scripts. +- **Monitoring and logging** (optional) + - Used by external observability systems if they need to pull metrics/logs from GCP. + +Service account keys should be avoided where possible in favor of Workload Identity and short‑lived credentials. + +#### 3.3 IAM configuration + +Apply least‑privilege principles: + +- Grant operators and service accounts only the roles needed for: + - Creating and managing KMS CVMs + - Managing networks and firewall rules relevant to KMS + - Accessing the container registry used for KMS images +- Avoid granting project‑wide owner roles to individuals for day‑to‑day operations. + +Document which teams or roles are allowed to: + +- Deploy or upgrade KMS instances +- Change firewall rules affecting KMS +- Access KMS logs and metrics ### 4. Network and Firewall -Describe: +Proper network design and firewall rules are critical for isolating the KMS and controlling access to it. + +#### 4.1 VPC and subnets + +Design the network to separate: + +- **KMS subnet**: where the KMS CVM(s) run +- **Management subnet**: jump hosts, bastion, or VPN endpoints used by operators +- **Application subnets**: Nitro host instances or other workloads that will call into the KMS + +Decide whether KMS instances: + +- Should have external IPs (often unnecessary and avoided in production) +- Will reach out to public RPC endpoints or use private connectivity to dedicated RPC providers + +#### 4.2 Firewall rules + +Define firewall rules for: + +- **Ingress to KMS** + - Allow HTTPS access on the KMS port (for example, `12001`) only from: + - Nitro hosts that will terminate RA‑TLS connections, and/or + - Authorized management networks (for diagnostics) + - Optionally allow access to the `auth-api` HTTP port (for example, `18000`) for internal health checks and debugging. -- VPC and subnet layout for KMS, management, and supporting services -- Firewall rules for: - - Ingress to the KMS API (if any) - - Egress to RPC providers, monitoring endpoints, and other dependencies -- Any network segmentation or isolation requirements +- **Egress from KMS** + - Allow outbound connections to: + - Ethereum RPC endpoints (direct‑RPC mode) + - Time sources and other dstack dependencies + - Monitoring and logging endpoints (for example, Datadog agents or exporters) + - Restrict outbound traffic to only what is needed. + +Ensure that: + +- SSH access to the KMS CVM is tightly controlled (or disabled where operationally feasible). +- Any management interfaces are reachable only from secure admin networks or VPNs. ### 5. Confidential VM Configuration -Explain how to: +The KMS runs in a GCP Confidential VM (CVM) with TDX support, providing hardware‑backed memory encryption and attestation. + +#### 5.1 Machine types + +Select machine types and regions that support Confidential VM with TDX. At the time of writing: + +- Not all regions or machine families support TDX. +- The GCP console and `gcloud` CLI indicate which machine types support TDX. + +Choose instance types that: + +- Support TDX Confidential VM +- Provide sufficient CPU and memory for the KMS workload and `dstack-cloud` runtime + +#### 5.2 Enabling Confidential VM (TDX) + +When defining the KMS instance (either manually or via `dstack-cloud`), ensure that: + +- Confidential VM is enabled +- TDX is selected as the underlying technology (where applicable) + +In many cases, `dstack-cloud` abstracts these details and sets the correct flags on the instance based on its configuration. Operators should still verify that: + +- The instance is reported as a Confidential VM in GCP +- The expected TEE (TDX) is used + +#### 5.3 Attestation and isolation + +Attestation and isolation are primarily handled by: -- Select appropriate machine types that support TDX/SGX confidential computing -- Configure images and boot parameters required by dstack-kms -- Validate that confidential computing and remote attestation are enabled +- The dstack OS image and runtime, which expose attestation evidence to `dstack-kms` +- The KMS logic, which validates attestation and enforces policy based on it + +From an infrastructure perspective, verification involves: + +- Confirming that the KMS instance boots with Confidential VM enabled +- Running KMS diagnostics (for example, `GetMeta` or equivalent) to ensure that attestation information is available and consistent with expectations ### 6. Base Image and KMS Image Preparation -This section should connect to the KMS image build process and explain: +This section connects the environment setup with the image build and deployment process described in the KMS image documentation. + +#### 6.1 Base OS image + +- Obtain a compatible dstack OS image (for example, `dstack-cloud-0.6.0`) that supports GCP TDX. +- Configure `dstack-cloud` to reference this image when creating KMS projects. +- Ensure that the OS image is stored in, or accessible from, the target region(s). + +Operators typically do not build the OS image themselves; instead, they consume images published by the dstack OS release process. + +#### 6.2 KMS container image + +- Build or obtain the KMS container image as described in the KMS image guide. +- Publish the image to a registry accessible from the KMS project: + - GCP Artifact Registry or GCR, or + - Another private registry reachable over the configured network -- Which base OS image is used -- How the dstack-kms image is derived from the base image -- How images are stored (e.g., in an internal image registry or custom image catalog) +Ensure that: + +- Image tags follow the agreed naming convention (for example, include environment and platform) +- The image reference used in `KMS_IMAGE` matches what is available in the registry ### 7. Verification and Smoke Tests -Finally, define simple steps to verify that: +After preparing the environment and deploying a KMS instance, run basic checks to confirm that the CVM and network are correctly configured. + +#### 7.1 Verify CVM and network + +- Confirm in the GCP console or via `gcloud` that: + - The KMS instance is running as a Confidential VM + - The instance has the expected machine type, zone, and labels +- Verify that: + - Required firewall rules are in place + - Only expected networks/hosts can reach the KMS HTTPS port + +#### 7.2 Verify KMS health + +- From an authorized host or Nitro test workload: + - Call the KMS health or metadata endpoint + - Verify that the KMS reports: + - The expected chain/network configuration + - The expected `DstackKms` contract address +- Check logs to ensure there are no unexpected errors during startup. + +#### 7.3 Verify connectivity to dependencies -- A new CVM instance can be created successfully -- Attestation and isolation settings are correct -- Basic connectivity (to RPC, monitoring, and CI/CD) is functioning +- Confirm that the KMS can reach: + - Ethereum RPC endpoints (or that the light client is syncing) + - Monitoring/logging endpoints +- For Nitro integration, run a simple RA‑TLS test using the reference Nitro application to confirm: + - Connectivity between Nitro hosts and the KMS + - That basic key requests work in the test environment diff --git a/docs/04-nitro-integration-and-ra-tls-guide.md b/docs/04-nitro-integration-and-ra-tls-guide.md index 8255450..d30b6fd 100644 --- a/docs/04-nitro-integration-and-ra-tls-guide.md +++ b/docs/04-nitro-integration-and-ra-tls-guide.md @@ -8,52 +8,164 @@ This guide complements the reference Nitro application and CI/CD template provid ### 2. Overview of RA-TLS in This Project -Provide a concise overview of: +At a high level: -- How RA-TLS is used to establish mutual TLS (mTLS) between the enclave and KMS -- How attestation evidence and measurements are embedded into the TLS handshake -- How the KMS validates attestation and decides whether to release keys +- The Nitro enclave establishes a mutual TLS (mTLS) connection to the KMS. +- The enclave’s attestation evidence and measurements are embedded into the TLS handshake. +- The KMS validates: + - The TLS chain up to the pinned KMS root CA (from `root_ca.pem` in the enclave image) + - The attestation quote and measurements against the on-chain `DstackKms` registry + - Any additional governance constraints (for example, whether the measurement is enabled) +- Only if these checks pass does the KMS release keys to the enclave. + +The RA-TLS flow binds: + +- The enclave code and configuration (represented by measurements and `OS_IMAGE_HASH`) +- The network connection (TLS) +- The authorization policy in `DstackKms` ### 3. dstack-agent-nitro -Describe: +The `dstack-agent-nitro` library (or agent) runs inside the enclave and is responsible for: + +- Establishing the RA-TLS connection to the KMS +- Presenting attestation evidence to the KMS +- Managing local key handles and secure use of keys provided by the KMS + +Conceptually, an enclave application using the agent will: + +1. Initialize the agent with: + - KMS URL + - Application ID (`APP_ID`) + - Root CA certificate for KMS TLS (`root_ca.pem`) +2. Ask the agent to open a RA-TLS session with the KMS. +3. Use the agent’s API to request keys and perform cryptographic operations. + +Pseudo-code (illustrative only): + +```pseudo +// Initialize agent +agent = DstackNitroAgent.new( + kms_url = "https://kms.example.com:12001", + app_id = "0xapp...", + root_ca_path = "/app/root_ca.pem", +) + +// Establish RA-TLS session (performs attestation and TLS handshake) +session = agent.connect() -- The role of `dstack-agent-nitro` in enclave applications -- How it is initialized and configured -- How it manages keys, certificates, and attestation evidence +// Request a named key +key = session.getKey("my-signing-key") -Include a minimal usage example or pseudocode for an enclave application integrating with the agent. +// Use key for an operation (e.g., sign or decrypt) +signature = session.sign(key, message) +``` + +Actual APIs may differ; the goal of this section is to show the pattern: initialize → connect (RA-TLS) → use keys via the agent. ### 4. getKey(name) API -Document: +The `getKey(name)` API is the primary way Nitro enclave applications obtain keys from the KMS. -- The semantics of the `getKey(name)` API -- Input parameters, return values, and error conditions -- How measurement-based authorization is enforced (e.g., which measurements are allowed and where they are configured) +Semantics: -This section should also outline typical error scenarios (e.g., measurement mismatch, lack of governance approval, attestation failures) and their expected responses. +- Input: + - `name`: logical name of the key, scoped by `APP_ID` and governance policy +- Behavior: + - Validates the RA-TLS session and associated attestation evidence + - Checks on-chain state in `DstackKms` to ensure: + - The enclave’s measurement (or `OS_IMAGE_HASH`) is authorized for this key + - The key is not disabled or restricted by governance + - Returns a key handle or key material (depending on implementation) -### 5. End-to-End Flow +Typical error scenarios: + +- **Measurement mismatch** + - The enclave’s measurement is not registered or is disabled in `DstackKms`. + - Expected outcome: `getKey(name)` fails with an authorization error; logs indicate unauthorized measurement. + +- **Lack of governance approval** + - The key or measurement was not approved via governance (for example, not yet passed through timelock or executed). + - Expected outcome: `getKey(name)` fails, and KMS logs show a governance policy violation. -Explain, step by step, the end-to-end flow for: +- **Attestation failures** + - RA-TLS attestation quote invalid, expired, or inconsistent with expected hardware. + - Expected outcome: the RA-TLS handshake fails or `getKey(name)` is rejected. -- Bootstrapping an enclave application -- Establishing RA-TLS with the KMS -- Requesting a key with `getKey(name)` -- Handling errors and retries +In all cases, enclave applications should distinguish between: + +- Retries for transient issues (for example, network glitches) +- Hard failures due to authorization or policy (for example, measurement not authorized) + +### 5. End-to-End Flow -Sequence diagrams are recommended to illustrate the interaction between enclave, agent, KMS, and any on-chain components involved in authorization. +At a high level, the end-to-end flow for RA-TLS and key retrieval is: + +1. **Build and register the enclave image** + - Build the EOF using the Nitro template CI. + - Compute `OS_IMAGE_HASH`. + - Register the measurement via governance (Safe + timelock) in `DstackKms`. +2. **Deploy the KMS** + - Ensure KMS is running and connected to the target chain. + - Verify KMS is configured with the correct `DstackKms` address. +3. **Deploy the enclave application** + - Configure `KMS_URL`, `APP_ID`, and `root_ca.pem`. + - Launch the enclave. +4. **RA-TLS handshake** + - Enclave agent connects to KMS using TLS, presenting attestation evidence. + - KMS verifies attestation and measurements. +5. **Key request** + - Enclave calls `getKey(name)` through the agent. + - KMS evaluates policy in `DstackKms` and governance state. + - If authorized, KMS returns the key (or key handle) to the enclave. +6. **Key usage and logging** + - Enclave uses the key for the intended cryptographic operations. + - KMS and enclave logs capture the request and any relevant metrics. + +Sequence diagrams can be added to this section as needed during implementation. ### 6. Demo Scenario Define a simple demo scenario that: -- Uses a reference Nitro Enclave application +- Uses the `dstack-nitro-enclave-app-template` as the Nitro application - Establishes RA-TLS with the KMS - Successfully calls `getKey(name)` and uses the key (for example, to sign or decrypt data) -The demo should be reproducible on a developer or test environment and serve as the basis for the "Verified RA-TLS demo" deliverable. +Step-by-step outline: + +1. **Prepare KMS** + - Deploy a KMS instance in a test environment, as described in the KMS deployment guide. + - Obtain the KMS root CA certificate and ensure governance contracts are deployed. + +2. **Prepare Nitro app** + - Fork or copy the `dstack-nitro-enclave-app-template` repository. + - Replace `app/root_ca.pem` with the KMS root CA. + - Set `KMS_URL` and `APP_ID` to point to the test KMS. + +3. **Build and release EIF** + - Use the provided GitHub Actions workflow to build the enclave image and publish a release containing: + - EIF file + - Measurements (PCRs) and `OS_IMAGE_HASH` + - Sigstore attestation bundle + +4. **Register measurement** + - For development/test: + - Optionally, use the direct `kms:add-image` task as shown in the Nitro template README. + - For production-like flows: + - Use `OS_IMAGE_HASH` as input to a governance proposal. + - Execute the proposal via Safe + timelock to register the image in `DstackKms`. + +5. **Run the demo** + - Launch the enclave using the built EIF. + - Trigger the example logic (for example, `dstack-util get-keys`) to: + - Establish RA-TLS with the KMS + - Request a key with `getKey(name)` + - Observe logs on both the enclave and KMS side to confirm: + - Attestation and measurement validation succeeded + - Key retrieval behaved as expected + +The demo should be reproducible on a developer or test environment and serves as the basis for the "Verified RA-TLS demo" deliverable. As a concrete reference, the `dstack-nitro-enclave-app-template` repository provides: diff --git a/docs/08-monitoring-and-alerting-guide.md b/docs/08-monitoring-and-alerting-guide.md index 84da115..b501e0f 100644 --- a/docs/08-monitoring-and-alerting-guide.md +++ b/docs/08-monitoring-and-alerting-guide.md @@ -71,17 +71,30 @@ At a minimum: ### 5. Alerting and Incident Management -Describe: +Define alert rules with clear conditions, thresholds, and severities. Example categories: -- Alert rules (conditions, thresholds, and severities) -- Integration with Incident.io (how alerts become incidents, routing, and escalation) -- On-call schedules and response expectations +- **KMS availability and latency** + - Alert if KMS error rate exceeds a threshold (for example, >1% of requests failing with 5xx over 5 minutes). + - Alert if p95 `getKey(name)` latency exceeds an agreed SLO (for example, >500 ms) for a sustained period. -Include alerts related to governance and multisig, such as: +- **RA-TLS and authorization** + - Alert on sustained RA-TLS failures (for example, `kms_ra_tls_failures_total` increasing rapidly). + - Alert on repeated authorization denials for the same enclave (which may indicate misconfiguration). -- Critical governance actions stuck in the timelock queue beyond an expected window -- Unexpected or frequent changes to timelock parameters (e.g., cooldown shortened below agreed thresholds) -- Repeated failures of governance transactions or unusually high governance activity that may indicate misconfiguration or abuse +- **Nitro enclave health** + - Alert if the number of running enclave instances drops below the expected count. + - Alert on excessive enclave restarts within a short window. + +- **Governance and multisig** + - Alert if critical governance actions are stuck in the timelock queue beyond an expected window. + - Alert on unexpected or frequent changes to timelock parameters (e.g., cooldown shortened below agreed thresholds). + - Alert on repeated failures of governance transactions or unusually high governance activity that may indicate misconfiguration or abuse. + +Integration with Incident.io should: + +- Create incidents from high-severity alerts (for example, KMS outage, governance misconfiguration) +- Route incidents to the appropriate on-call rotation (platform, security, or governance) +- Provide runbook links for quick reference during triage ### 6. Operational Procedures diff --git a/docs/09-runbook-operations-upgrade-troubleshooting.md b/docs/09-runbook-operations-upgrade-troubleshooting.md index d84763c..e8fb1a6 100644 --- a/docs/09-runbook-operations-upgrade-troubleshooting.md +++ b/docs/09-runbook-operations-upgrade-troubleshooting.md @@ -6,11 +6,11 @@ This runbook provides step-by-step procedures for operating the KMS system, incl ### 2. Day-to-day Operations -Describe routine tasks such as: +Routine tasks include: -- Provisioning new environments (staging, pre-production, production) -- Rotating keys or updating configuration -- Verifying daily/weekly health of the system +- Provisioning new environments (staging, pre-production, production) using the procedures in the deployment guide +- Rotating keys or updating configuration via governance-controlled changes and KMS redeployments +- Verifying daily/weekly health of the system using dashboards, logs, and spot checks ### 3. Deployment Procedures @@ -34,23 +34,92 @@ Define clear rollback plans and criteria for aborting an upgrade. ### 5. Troubleshooting Common Issues -Provide structured troubleshooting flows for issues such as: - -- RA-TLS failures or attestation errors -- Enclave workloads failing to start or connect -- On-chain authorization failures -- Monitoring alerts indicating degraded performance or availability -- Governance and multisig issues, such as: - - Governance transactions stuck in the multisig or timelock queue - - Mistaken transactions queued but not yet executed - - Failed governance executions due to parameter or state errors - -Each flow should include: +This section provides example troubleshooting flows for common issues. Each flow outlines: - How to detect the issue - Where to look (logs, dashboards, metrics) - Possible root causes and remediation steps +#### 5.1 RA-TLS failures or attestation errors + +- **Symptoms** + - Enclave applications log RA-TLS handshake failures. + - KMS logs show attestation validation errors or unauthorized measurement errors. + - Metrics such as `kms_ra_tls_failures_total` increase. + +- **Checks** + 1. Verify that the enclave is using the correct `KMS_URL`, `APP_ID`, and `root_ca.pem`. + 2. Confirm that the `OS_IMAGE_HASH` for the running enclave image is correctly computed and matches what is expected. + 3. Check on-chain `DstackKms` state to ensure the measurement is registered and enabled. + 4. Verify that the KMS instance is using the correct `KMS_CONTRACT_ADDR` and connected to the intended chain. + +- **Root causes and remediation** + - Measurement not registered or disabled: + - Register or re-enable via governance, wait for timelock, then retry. + - Mismatched root CA or KMS URL: + - Update enclave configuration, rebuild and redeploy the enclave image, then retry. + - Chain or RPC issues: + - Investigate and restore RPC connectivity, then retry. + +#### 5.2 Enclave workloads failing to start or connect + +- **Symptoms** + - Nitro hosts report failure to start the enclave. + - Enclave instances start but fail to reach the KMS (connection timeouts). + +- **Checks** + 1. Inspect Nitro host logs and the enclave template’s logs for startup errors. + 2. Verify that the KMS HTTPS port is reachable from the Nitro hosts (firewall, routing). + 3. Confirm that the KMS instance is running and healthy. + +- **Root causes and remediation** + - Misconfigured Nitro host or EIF: + - Correct configuration, rebuild EIF if needed, and restart the enclave. + - Network/firewall misconfiguration: + - Update firewall rules to allow required traffic between Nitro hosts and KMS, then test connectivity. + +#### 5.3 On-chain authorization failures + +- **Symptoms** + - KMS logs indicate that authorization checks failed for `getKey(name)` despite RA-TLS succeeding. + - Enclave receives authorization errors rather than attestation errors. + +- **Checks** + 1. Inspect the `DstackKms` contract state to verify: + - The enclave measurement is mapped to the expected key or application. + - The key is not disabled. + 2. Confirm that the correct `DstackKms` contract address is configured in the KMS. + +- **Root causes and remediation** + - Missing or incorrect on-chain configuration: + - Create and execute a governance proposal to correct the state. + - Using the wrong contract or network: + - Update KMS configuration to point to the correct `DstackKms` address and network, then redeploy. + +#### 5.4 Governance and multisig issues + +- **Transactions stuck in timelock or multisig queue** + - **Symptoms** + - Governance dashboards show pending transactions that are not executed. + - Changes expected by operators (for example, new measurement registration) do not take effect. + - **Checks** + 1. Inspect the Safe UI and timelock queue to see pending transactions and their earliest execution times. + 2. Verify that the required number of signatures has been collected. + - **Remediation** + - Collect missing signatures. + - After cooldown, execute the transaction. + - If the transaction is mistaken, follow the cancellation/skip procedures defined in the governance manual and submit a corrected transaction. + +- **Failed governance executions** + - **Symptoms** + - On-chain governance transactions revert or run out of gas. + - **Checks** + 1. Review transaction failure reasons and logs. + 2. Validate parameters and target addresses in the proposed transaction. + - **Remediation** + - Correct parameters and resubmit through the normal governance flow. + - Adjust gas limits as necessary. + ### 6. Emergency Procedures Define emergency playbooks for: diff --git a/docs/10-code-walkthrough-and-kt-materials.md b/docs/10-code-walkthrough-and-kt-materials.md index 431f09a..072d5f5 100644 --- a/docs/10-code-walkthrough-and-kt-materials.md +++ b/docs/10-code-walkthrough-and-kt-materials.md @@ -8,22 +8,47 @@ This document serves as supporting material for code walkthrough sessions and kn Provide a high-level overview of the main repositories and components, for example: -- dstack-kms (KMS service) -- dstack-agent-nitro (Nitro agent/library) -- Smart contracts (DstackApp, DstackKms, governance) -- CI/CD configuration repositories - -For each repository, briefly describe its responsibility and structure. +- **dstack-kms** (KMS service) + - Implements the core KMS logic, RA-TLS handling, and integration with `DstackKms` on-chain state. + - Typical layout: service entrypoints, RA-TLS module, on-chain client, configuration and logging modules. +- **dstack-agent-nitro** (Nitro agent/library) + - Runs inside Nitro enclaves, establishes RA-TLS to the KMS, and exposes a simple API to enclave applications. + - Typical layout: RA-TLS client, attestation utilities, key usage helpers. +- **Smart contracts** (DstackApp, DstackKms, governance) + - Solidity or other on-chain code defining KMS policies, application integration, and governance (Safe + timelock). + - Typical layout: core contracts, upgrade proxies (if used), deployment scripts, tests. +- **CI/CD configuration repositories** + - Pipeline definitions for: + - KMS image builds + - Nitro EIF builds and releases + - Governance deployment scripts + +For each repository, briefly describe: + +- Directory structure +- How to build and test locally +- How it is tied into the overall deployment (for example, which images or contracts it produces) ### 3. Core Flows Highlight and link to the code implementing key flows, such as: -- Key request and release path (including RA-TLS handling) -- Governance state reading and enforcement in KMS -- Telemetry and logging integration +- **Key request and release path** + - Entrypoint in the KMS service that handles `getKey(name)`. + - RA-TLS verification logic (where attestation evidence is parsed and validated). + - Policy enforcement based on on-chain state from `DstackKms`. +- **Governance state reading and enforcement** + - Modules that read from `DstackKms` and related contracts (for example, via an on-chain client or caching layer). + - How the KMS reacts to changes in governance state (for example, updated measurements). +- **Telemetry and logging integration** + - Where metrics are emitted (for example, request counts, latency, failures). + - Where structured logs are written and how they are correlated across components. + +For each flow, describe: -For each flow, describe the entry points, main modules, and typical extension points. +- The main entrypoints (functions, services) +- Key modules involved in the flow +- Typical extension points (for example, how to add a new key type or telemetry field) ### 4. Development Practices @@ -45,8 +70,20 @@ Provide guidance on: Include a proposed agenda for live KT sessions, for example: -- Architecture and threat model (summary) -- Walkthrough of key repositories and flows -- Live demo of deployment and RA-TLS key retrieval -- Q&A and discussion of future extensions +1. **Architecture and threat model (30–45 min)** + - Use `01-architecture-and-threat-model.md` as the primary reference. + - Focus on components, trust boundaries, and where governance and TEEs fit in. +2. **Repositories and core flows (45–60 min)** + - Walk through the main repositories listed in this document. + - Trace the key request path from enclave to KMS to chain and back. +3. **Deployment and governance (45–60 min)** + - Demonstrate KMS deployment using `dstack-cloud` (from `02` and `03`). + - Show how Safe + timelock governance changes are applied and reflected in KMS behavior (`05` and `06`). +4. **Nitro and RA-TLS demo (30–45 min)** + - Use the Nitro template and the RA-TLS demo steps from `04`. + - Show `OS_IMAGE_HASH` registration via governance and successful `getKey(name)` calls. +5. **Monitoring, runbooks, and Q&A (30–45 min)** + - Review dashboards and alerts defined in `08`. + - Walk through one or two troubleshooting scenarios from `09`. + - Open Q&A and discussion of future extensions (for example, new clouds or TEE types). diff --git a/docs/11-e2e-test-and-preprod-validation-report.md b/docs/11-e2e-test-and-preprod-validation-report.md index 00de4d6..814b97f 100644 --- a/docs/11-e2e-test-and-preprod-validation-report.md +++ b/docs/11-e2e-test-and-preprod-validation-report.md @@ -48,6 +48,13 @@ Summarize: - Which environment and versions (KMS image, enclave image, contracts) were used - Overall result (e.g., pass rate, major issues found) +For critical scenarios (for example, E2E-001 and GOV-001), provide enough detail that the tests can be repeated independently, including: + +- Environment name (staging, pre-production, production candidate) +- KMS image tag and OS image version +- Enclave image version and `OS_IMAGE_HASH` +- Governance contract and Safe addresses used + ### 5. Detailed Results and Findings Document: @@ -63,6 +70,12 @@ Describe: - Additional checks performed specifically in pre-production (e.g., realistic traffic patterns, integration with customer systems) - Sign-off criteria and who approved them +Example checks may include: + +- Load tests at expected peak traffic levels +- End-to-end flows involving real or representative Nitro workloads +- Governance operations performed with production-like signer sets and timelock settings + ### 7. Known Limitations and Risks List: diff --git a/docs/13-implementation-guide.md b/docs/13-implementation-guide.md new file mode 100644 index 0000000..161d817 --- /dev/null +++ b/docs/13-implementation-guide.md @@ -0,0 +1,168 @@ +## Implementation Guide (From POC to Production) + +### 1. Purpose + +This guide provides an end‑to‑end view of how to deploy and operate the dstack‑based KMS solution, from a minimal proof‑of‑concept (POC) to a production‑grade deployment. + +It is meant to be the first document new teams read, with deep links into the more detailed design and operations documents. + +### 2. Recommended Reading Order + +For a first implementation, we recommend: + +1. This document (`13-implementation-guide.md`) +2. Architecture and threat model (`01-architecture-and-threat-model.md`) +3. GCP environment and CVM setup (`02-gcp-environment-and-cvm-setup-guide.md`) +4. KMS image and deployment scripts (`03-kms-image-and-deployment-scripts.md`) +5. Governance and on-chain integration design (`05-governance-and-onchain-integration-design.md`) +6. Governance deployment and operations (`06-governance-deployment-and-operations-manual.md`) +7. Nitro integration and RA-TLS (`04-nitro-integration-and-ra-tls-guide.md`) +8. Nitro CI/CD and governance integration (`07-nitro-cicd-and-governance-integration.md`) +9. Monitoring and runbook (`08-monitoring-and-alerting-guide.md`, `09-runbook-operations-upgrade-troubleshooting.md`) + +### 3. POC Path (Single Region, Direct RPC) + +The goal of the POC is to demonstrate: + +- A KMS instance running on GCP Confidential VM (TDX) +- A Nitro enclave app performing RA-TLS to the KMS +- Key retrieval via `getKey(name)` gated by governance state + +#### 3.1 Prepare GCP environment + +- Follow `02-gcp-environment-and-cvm-setup-guide.md` to: + - Create or select a GCP project + - Configure VPC, subnets, and basic firewall rules + - Ensure Confidential VM (TDX) support is available + +For a POC, a single project and a single region are sufficient. + +#### 3.2 Deploy minimal governance stack + +- Follow `05-governance-and-onchain-integration-design.md` and `06-governance-deployment-and-operations-manual.md` to: + - Deploy a governance multisig (Safe) on the chosen network + - Optionally deploy a timelock module (for POC, a shorter delay may be used) + - Deploy the `DstackKms` and `DstackApp` contracts + - Transfer ownership to the governance multisig + +In early POCs, timelock configuration can be simplified, but the overall pattern (multisig owning the contracts) should be preserved. + +#### 3.3 Build and deploy the KMS + +- Build or select a KMS image as described in `03-kms-image-and-deployment-scripts.md`: + - For a POC, direct RPC mode is usually sufficient. +- Use `dstack-cloud` to: + - Create a KMS project directory + - Configure `prelaunch.sh` with: + - `KMS_IMAGE` + - `KMS_HTTPS_PORT` and `AUTH_HTTP_PORT` + - `ETH_RPC_URL` pointing to the testnet RPC + - `KMS_CONTRACT_ADDR` pointing to the `DstackKms` contract + - Deploy the KMS CVM and complete the Bootstrap process. + +#### 3.4 Deploy a reference Nitro enclave app + +- Use the `dstack-nitro-enclave-app-template` repository as described in: + - `04-nitro-integration-and-ra-tls-guide.md` + - `07-nitro-cicd-and-governance-integration.md` +- For the POC: + - Replace the KMS root CA certificate (`app/root_ca.pem`) with the KMS’s root CA + - Set `KMS_URL` and `APP_ID` to point to the POC KMS + - Build and release an EIF using the provided CI workflow + +#### 3.5 Register the enclave measurement + +- Read the `OS_IMAGE_HASH` from the Nitro app release. +- For a POC: + - You may use the development path described in the Nitro template README (direct `kms:add-image` call) or + - Preferably, create a governance transaction via the multisig to register the measurement. + +#### 3.6 Run the RA-TLS demo + +- Launch the Nitro enclave. +- Use the reference app to: + - Establish RA-TLS to the KMS + - Request a key via `getKey(name)` +- Confirm that: + - Requests from the authorized measurement succeed + - Requests from an unauthorized measurement fail + +Use `11-e2e-test-and-preprod-validation-report.md` as a checklist for POC‑level tests (E2E-001/010, GOV-001 etc.). + +### 4. Production Path (Hardened Deployment) + +The production rollout path builds on the POC but with stricter controls and additional components. + +#### 4.1 Harden governance + +- Use `05` and `06` to: + - Configure the production Safe with: + - More owners across independent teams or organizations + - A higher signature threshold + - Enable and configure the timelock module with: + - Production‑appropriate cooldown delays (for example, 24–72 hours) + - Reasonable expiration times for queued transactions + - Document the governance roles and processes (who can propose, sign, and execute). + +All production changes to `DstackKms`, `DstackApp`, and related contracts should go through the Safe + timelock flow. + +#### 4.2 Harden the KMS deployment + +- Follow `02` and `03` to: + - Place KMS instances in dedicated subnets with tight firewall rules + - Remove unnecessary external IPs and restrict SSH access + - Use a private container registry for KMS images + - Version KMS images and maintain a rollback strategy +- Consider running: + - Light‑client mode with `helios` to reduce RPC dependencies + - Multiple KMS instances (for example, using Onboard) for redundancy and key replication + +#### 4.3 Harden Nitro CI/CD and registration + +- Use `07` and the Nitro template to: + - Ensure Nitro CI builds and releases are reproducible and attested (Sigstore) + - Treat `OS_IMAGE_HASH` as an input to governance proposals rather than calling KMS registration contracts directly from CI + - Require governance approval (multisig + timelock) for all new or updated enclave measurements in production + +#### 4.4 Monitoring, alerting, and runbooks + +- Implement the monitoring and alerting setup described in: + - `08-monitoring-and-alerting-guide.md` + - `09-runbook-operations-upgrade-troubleshooting.md` +- At a minimum: + - Monitor KMS availability, latency, and error rates + - Monitor RA‑TLS and attestation failures + - Monitor governance transactions, timelock queues, and failures + - Define on‑call rotation and incident response procedures + +#### 4.5 Pre-production validation and go-live + +- Use `11-e2e-test-and-preprod-validation-report.md` to: + - Run a full E2E test suite in a pre‑production environment (mirroring production as closely as possible) + - Validate governance flows (Safe + timelock), Nitro integration, and monitoring + - Capture results and sign‑offs from both vendor and customer teams + +After successful pre‑production validation, repeat the same deployment and governance steps in the production environment, with appropriate modifications (for example, production RPC endpoints and Safe signer sets). + +### 5. Summary Checklists + +#### 5.1 POC checklist + +- [ ] GCP project, network, and CVM support verified (`02`) +- [ ] Governance Safe and basic KMS contracts deployed (`05`, `06`) +- [ ] Single‑region KMS deployed in direct‑RPC mode (`03`) +- [ ] Reference Nitro app built and released (`04`, Nitro template) +- [ ] Enclave measurement registered (dev or governance path) +- [ ] RA‑TLS demo completed (`getKey(name)` success/failure cases) +- [ ] Basic monitoring and logs accessible + +#### 5.2 Production checklist + +- [ ] Production governance Safe set up with appropriate owners and thresholds (`05`, `06`) +- [ ] Timelock configured with agreed delays and expiration (`05`, `06`) +- [ ] KMS deployed with hardened network, registry, and OS image practices (`02`, `03`) +- [ ] Nitro CI/CD integrated with governance (no direct EOA registrations) (`07`) +- [ ] Monitoring, alerting, and runbooks implemented and tested (`08`, `09`) +- [ ] E2E test suite executed in pre‑production and results documented (`11`) +- [ ] Production go‑live executed following approved change management processes + diff --git a/docs/14-configuration-reference.md b/docs/14-configuration-reference.md new file mode 100644 index 0000000..ded5ea5 --- /dev/null +++ b/docs/14-configuration-reference.md @@ -0,0 +1,81 @@ +## Configuration Reference + +This document summarizes key configuration parameters used across the KMS, Nitro applications, governance contracts, and supporting infrastructure. + +It is intended as a quick reference for operators and integrators. + +### 1. KMS Runtime Configuration (.env) + +These parameters are typically written by `prelaunch.sh` into the KMS project `.env` file and consumed by `docker-compose`. + +| Name | Component | Description | Example | Sensitive | Source | +|--------------------|-----------|------------------------------------------------------------|----------------------------------------------|----------|-----------------| +| `KMS_IMAGE` | KMS | Container image used for the KMS runtime | `registry.example.com/dstack-kms:v1.0.0` | No | Config / CI | +| `KMS_HTTPS_PORT` | KMS | External HTTPS port for KMS API | `12001` | No | Config | +| `AUTH_HTTP_PORT` | KMS | Internal HTTP port for `auth-api` | `18000` | No | Config | +| `ETH_RPC_URL` | KMS | Ethereum RPC endpoint (direct RPC mode) | `https://base-sepolia.example.com` | Yes | Secret / Config | +| `KMS_CONTRACT_ADDR`| KMS | On-chain `DstackKms` contract address | `0x1234...abcd` | No | Config | +| `HELIOS_IMAGE` | KMS | Optional Helios light-client container image | `registry.example.com/helios:v0.5.0` | No | Config | +| `DSTACK_REPO` | KMS | dstack or dstack-cloud repository reference | `https://github.com/Dstack-TEE/dstack-cloud` | No | Config / CI | +| `DSTACK_REF` | KMS | Commit or tag of dstack/dstack-cloud | `14963a2ccb0ec7be...` | No | Config / CI | + +Additional environment variables may be introduced as the implementation evolves; they should be documented alongside the deployment scripts. + +### 2. Nitro Enclave Application Configuration + +These parameters are used when building and configuring Nitro enclave applications that integrate with the KMS. + +| Name / File | Component | Description | Example | Sensitive | Source | +|---------------------|------------|---------------------------------------------------------|------------------------------------------|----------|------------| +| `KMS_URL` | Nitro app | URL of the KMS HTTPS endpoint | `https://kms.example.com:12001` | No | Config | +| `APP_ID` | Nitro app | Application identifier used for key scoping | `0xapp...id` | No | Config | +| `app/root_ca.pem` | Nitro app | Root CA certificate for KMS TLS pinning | PEM file | Yes | Secret | +| `OS_IMAGE_HASH` | Nitro app | Combined PCR hash computed from EIF measurements | `0xabc1...def2` | No | CI output | + +Changing `KMS_URL`, `APP_ID`, or `app/root_ca.pem` affects the enclave measurements and therefore the `OS_IMAGE_HASH`. These changes must be coordinated with governance to update the set of authorized measurements. + +### 3. Governance and On-chain Addresses + +These values are typically stored in configuration management systems and used by both on-chain deployment scripts and off-chain components. + +| Name | Component | Description | Example | Sensitive | Source | +|---------------------------|------------------|--------------------------------------------------|----------------|----------|-----------------| +| `GOV_SAFE_ADDRESS` | Governance | Address of the governance multisig (Safe) | `0xSafe...001` | No | On-chain / Config | +| `GOV_TIMELOCK_ADDRESS` | Governance | Address of the timelock module attached to Safe | `0xTime...002` | No | On-chain / Config | +| `DSTACK_KMS_ADDRESS` | Governance/KMS | `DstackKms` contract proxy address | `0xKms...003` | No | On-chain / Config | +| `DSTACK_APP_ADDRESS` | Governance/App | `DstackApp` contract address | `0xApp...004` | No | On-chain / Config | + +Governance deployment and changes to these addresses must follow the processes described in the governance documents. + +### 4. Monitoring and Metrics (Examples) + +The exact metric names depend on the implementation, but the following categories should be covered: + +#### 4.1 KMS metrics + +- Request counts and error rates: + - `kms_requests_total{status="ok|error"}` + - `kms_ra_tls_failures_total` +- Latency: + - `kms_request_latency_seconds{quantile="0.5|0.95|0.99"}` + +#### 4.2 Nitro and RA-TLS metrics + +- Enclave lifecycle: + - `nitro_enclave_instances_running` + - `nitro_enclave_restarts_total` +- RA-TLS status: + - `nitro_ra_tls_success_total` + - `nitro_ra_tls_failure_total` + +#### 4.3 Governance metrics + +- Timelock queue: + - `governance_timelock_pending_total` + - `governance_timelock_oldest_age_seconds` +- Transaction outcomes: + - `governance_tx_success_total` + - `governance_tx_failure_total` + +These metric names are examples and should be adapted to the actual monitoring implementation. The monitoring guide (`08-monitoring-and-alerting-guide.md`) provides more context on how to use and alert on these metrics. +