feat: Add Kubernetes deployment support with preserved E2E test coverage#139
feat: Add Kubernetes deployment support with preserved E2E test coverage#139RonTuretzky wants to merge 2 commits into
Conversation
This PR refactors the Docker Compose setup to Kubernetes while preserving all existing E2E test coverage: ## Changes ### Kubernetes Manifests (k8s/) - Base manifests using Kustomize for namespace, RBAC, ConfigMaps, PVC - Ethereum deployment and service - EigenLayer setup as a Kubernetes Job (init-style pattern) - Cerberus signer deployment - AVS nodes as a StatefulSet (3 replicas with stable identities) - Router deployment with ingress support ### Overlays - CI overlay: NodePort services, local image pull policy - Local overlay: NodePort services for development ### CI/CD - New workflow: k8s-integration-test.yml - Uses Kind (Kubernetes in Docker) for ephemeral clusters - Preserves all test scenarios: - Counter increment test (all branches) - Fast aggregation test (dev branch) - Ingress endpoint test (dev branch) ### Documentation - Full technical spec at docs/k8s-refactor-spec.md - Architecture diagrams - Migration path Closes #138
There was a problem hiding this comment.
Pull request overview
This PR successfully refactors the deployment infrastructure from Docker Compose to Kubernetes while preserving all existing E2E test coverage. The implementation uses Kustomize for configuration management with separate base manifests and overlays for CI and local environments.
Key Changes:
- Adds comprehensive Kubernetes manifests using Kustomize structure (base + overlays) for all services (ethereum, eigenlayer, signer, nodes, router)
- Implements new
k8s-integration-test.ymlCI workflow using Kind for E2E testing with all existing test scenarios preserved - Provides detailed technical specification documentation covering architecture, deployment flows, and migration strategy
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/k8s-integration-test.yml |
New CI workflow for K8s-based E2E testing using Kind, implementing counter increment, fast aggregation, and ingress tests |
k8s/base/namespace.yaml |
Defines commonware-avs namespace with standard Kubernetes labels |
k8s/base/configmap.yaml |
Contains AVS configuration (quorum settings, operator addresses) and environment variables including test account keys |
k8s/base/pvc.yaml |
Persistent volume claim for sharing operator keys and deployment artifacts across pods |
k8s/base/rbac.yaml |
Service account and RBAC permissions for init containers to check Job completion status |
k8s/base/kustomization.yaml |
Base Kustomize configuration aggregating all K8s resources |
k8s/base/ethereum/deployment.yaml |
Ethereum node deployment with readiness/liveness probes |
k8s/base/ethereum/service.yaml |
ClusterIP service exposing Ethereum RPC endpoint |
k8s/base/eigenlayer/job.yaml |
Job for EigenLayer contract deployment and BLS key generation |
k8s/base/signer/deployment.yaml |
Cerberus signer deployment with init container waiting for eigenlayer-setup completion |
k8s/base/signer/service.yaml |
Service exposing signer gRPC and metrics ports |
k8s/base/nodes/statefulset.yaml |
StatefulSet for 3 AVS nodes with dynamic port assignment and stable network identities |
k8s/base/nodes/service.yaml |
Headless service for StatefulSet pod discovery |
k8s/base/router/deployment.yaml |
Router deployment with init container waiting for all nodes to be ready |
k8s/base/router/service.yaml |
Service exposing router app and ingress endpoints |
k8s/overlays/ci/kustomization.yaml |
CI-specific overlay with image name overrides and patches |
k8s/overlays/ci/patches/image-pull-policy.yaml |
Sets imagePullPolicy to Never/IfNotPresent for CI environment |
k8s/overlays/ci/patches/nodeport-services.yaml |
Converts services to NodePort for external access in Kind |
k8s/overlays/ci/patches/pvc-storage-class.yaml |
Ensures PVC uses 'standard' storage class available in Kind |
k8s/overlays/local/kustomization.yaml |
Local development overlay configuration |
k8s/overlays/local/patches/nodeport-services.yaml |
NodePort services for local cluster access |
k8s/kind-config.yaml |
Kind cluster configuration with port mappings for ethereum, router app, and ingress |
docs/k8s-refactor-spec.md |
Comprehensive technical specification covering architecture, flows, manifests, and migration strategy |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| resources: | ||
| requests: | ||
| storage: 1Gi | ||
| storageClassName: standard |
There was a problem hiding this comment.
The base PVC specifies storageClassName: standard, which may not exist in all Kubernetes clusters. The "standard" storage class is not guaranteed to exist and varies by cluster provider. Consider either: (1) removing the storageClassName to use the cluster's default storage class, or (2) documenting the assumption that a "standard" storage class must exist. The CI overlay already patches this, which is good practice.
| storageClassName: standard |
| apiVersion: v1 | ||
| kind: ServiceAccount | ||
| metadata: | ||
| name: avs-service-account | ||
| namespace: commonware-avs | ||
| --- | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: Role | ||
| metadata: | ||
| name: job-reader | ||
| namespace: commonware-avs | ||
| rules: | ||
| - apiGroups: ["batch"] | ||
| resources: ["jobs"] | ||
| verbs: ["get", "list", "watch"] | ||
| --- | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: RoleBinding | ||
| metadata: | ||
| name: job-reader-binding | ||
| namespace: commonware-avs | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: avs-service-account | ||
| namespace: commonware-avs | ||
| roleRef: | ||
| kind: Role | ||
| name: job-reader | ||
| apiGroup: rbac.authorization.k8s.io |
There was a problem hiding this comment.
[nitpick] The service account avs-service-account is assigned to the signer deployment (line 62 in signer/deployment.yaml) and nodes StatefulSet (line 98 in nodes/statefulset.yaml), giving them permissions to read Jobs. However, only the signer and nodes need this permission for their init containers to check job completion. The router and ethereum deployments don't specify a serviceAccountName, so they'll use the default service account. This is correct, but consider documenting why only specific workloads need this RBAC permission.
| spec: | ||
| initContainers: | ||
| - name: wait-for-eigenlayer | ||
| image: bitnami/kubectl:latest |
There was a problem hiding this comment.
The documentation shows bitnami/kubectl:latest but the actual implementation uses bitnami/kubectl:1.28 (line 20 in signer/deployment.yaml and line 22 in nodes/statefulset.yaml). The documentation should be updated to match the implementation, which correctly pins a specific version instead of using latest.
| image: bitnami/kubectl:latest | |
| image: bitnami/kubectl:1.28 |
| └── local/ | ||
| ├── kustomization.yaml | ||
| └── patches/ | ||
| └── node-ports.yaml |
There was a problem hiding this comment.
The documentation refers to node-ports.yaml but the actual file is named nodeport-services.yaml (in k8s/overlays/local/patches/). Update the documentation to match the actual filename.
| └── node-ports.yaml | |
| └── nodeport-services.yaml |
| ├── ci/ | ||
| │ ├── kustomization.yaml | ||
| │ └── patches/ | ||
| │ └── image-pull-policy.yaml |
There was a problem hiding this comment.
The documentation shows only image-pull-policy.yaml under CI patches, but the actual CI overlay includes three patch files: image-pull-policy.yaml, nodeport-services.yaml, and pvc-storage-class.yaml (see k8s/overlays/ci/kustomization.yaml lines 10-12). The documentation should list all three patches.
| │ └── image-pull-policy.yaml | |
| │ ├── image-pull-policy.yaml | |
| │ ├── nodeport-services.yaml | |
| │ └── pvc-storage-class.yaml |
| name: commonware-avs | ||
| labels: | ||
| app.kubernetes.io/name: commonware-avs | ||
| app.kubernetes.io/part-of: commonware-restaking |
There was a problem hiding this comment.
The namespace defines a label app.kubernetes.io/part-of: commonware-restaking, but the kustomization.yaml commonLabels (line 22 in k8s/base/kustomization.yaml) uses app.kubernetes.io/part-of: commonware-avs. This inconsistency could cause confusion. Consider aligning these labels to use the same value.
| app.kubernetes.io/part-of: commonware-restaking | |
| app.kubernetes.io/part-of: commonware-avs |
| kubectl logs deployment/router -n commonware-avs --tail 50 | ||
| exit 1 | ||
| fi | ||
|
|
There was a problem hiding this comment.
The success message states "10 seconds" but the test actually waits for 60 seconds (line 219: sleep 60). This should be updated to say "~60 seconds" to match the actual wait time.
| - containerPort: 30000 | ||
| hostPort: 4000 | ||
| protocol: TCP |
There was a problem hiding this comment.
[nitpick] The hostPort mapping for the router app port maps containerPort 30000 to hostPort 4000, but the workflow accesses the router on port 3000 (not 4000). This appears to be intentional since the router service uses NodePort 30000 for the app port internally, but it creates confusion. The workflow doesn't actually use port 4000 anywhere - it only uses port 8545 (ethereum) and 8080 (ingress). Consider removing this unused port mapping or documenting why it's included if it's for future use or local development.
| - containerPort: 30000 | |
| hostPort: 4000 | |
| protocol: TCP |
| "address": "router" | ||
| } | ||
| router_orchestrator.json: | | ||
| { |
There was a problem hiding this comment.
A private key is hardcoded in the ConfigMap. While this appears to be for testing/development purposes, it should be documented or have a clear comment indicating this is a test-only key and should never be used in production. Consider adding a comment above this field explaining its test-only nature.
| { | |
| { | |
| "__comment": "This privateKey is for testing/development only. NEVER use this key in production.", |
| - "3000" | ||
| volumeMounts: | ||
| - name: nodes-data | ||
| mountPath: /app/.nodes |
There was a problem hiding this comment.
The router pod mounts the entire nodes-data PVC at /app/.nodes, which also stores the BLS operator private keys generated by the eigenlayer-setup job and used by the AVS nodes. This gives the internet-facing router process unnecessary read access to highly sensitive key material, so a compromise of the router (e.g., via an RCE or path traversal bug) would allow an attacker to exfiltrate operator keys and impersonate AVS nodes. To reduce blast radius, avoid mounting the full nodes-data volume into the router and instead expose only the minimal artifact it needs (e.g., a separate volume or subPath for avs_deploy.json) or refactor so keys reside in a dedicated Secret/volume that is not accessible to the router.
| mountPath: /app/.nodes | |
| mountPath: /app/.nodes/avs_deploy.json | |
| subPath: avs_deploy.json |
Fixes based on Copilot and manual review: Security improvements: - Move router private key from ConfigMap to Secret - Router now only mounts avs_deploy.json instead of entire .nodes directory (prevents access to operator keys if router is compromised) Port handling fix: - All AVS nodes now listen on port 3001 (each pod has its own IP in K8s) - Simplifies headless service configuration - Fixes readiness/liveness probe port mismatch Configuration fixes: - Remove storageClassName from base PVC (use cluster default) - Fix namespace label inconsistency (commonware-avs) - Add comments to kind port mappings explaining usage - Update operator addresses in config to use port 3001 Documentation updates: - Add rbac.yaml and secret.yaml to directory layout - Add all CI overlay patches to directory layout - Fix bitnami/kubectl version to 1.28 (not latest) - Update examples to reflect security improvements
Summary
This PR refactors the Docker Compose setup to Kubernetes while preserving all existing E2E test coverage.
k8s-integration-test.ymlusing KindTest plan
k8s-integration-test.ymlworkflow passeskubectl apply -k k8s/overlays/local/Closes #138
cc @dijarllozana