Skip to content

feat: Add Kubernetes deployment support with preserved E2E test coverage#139

Open
RonTuretzky wants to merge 2 commits into
devfrom
RonTuretzky/k8s-refactor
Open

feat: Add Kubernetes deployment support with preserved E2E test coverage#139
RonTuretzky wants to merge 2 commits into
devfrom
RonTuretzky/k8s-refactor

Conversation

@RonTuretzky
Copy link
Copy Markdown
Contributor

Summary

This PR refactors the Docker Compose setup to Kubernetes while preserving all existing E2E test coverage.

  • Add Kubernetes manifests using Kustomize (base + overlays for CI and local)
  • Add new CI workflow k8s-integration-test.yml using Kind
  • Preserve all existing test scenarios: counter increment, fast aggregation, ingress
  • Add full technical spec documentation

Test plan

  • Verify k8s-integration-test.yml workflow passes
  • Verify counter increment test passes
  • Verify fast aggregation test passes (dev branch)
  • Verify ingress endpoint test passes (dev branch)
  • Test local development with kubectl apply -k k8s/overlays/local/

Closes #138

cc @dijarllozana

This PR refactors the Docker Compose setup to Kubernetes while preserving
all existing E2E test coverage:

## Changes

### Kubernetes Manifests (k8s/)
- Base manifests using Kustomize for namespace, RBAC, ConfigMaps, PVC
- Ethereum deployment and service
- EigenLayer setup as a Kubernetes Job (init-style pattern)
- Cerberus signer deployment
- AVS nodes as a StatefulSet (3 replicas with stable identities)
- Router deployment with ingress support

### Overlays
- CI overlay: NodePort services, local image pull policy
- Local overlay: NodePort services for development

### CI/CD
- New workflow: k8s-integration-test.yml
- Uses Kind (Kubernetes in Docker) for ephemeral clusters
- Preserves all test scenarios:
  - Counter increment test (all branches)
  - Fast aggregation test (dev branch)
  - Ingress endpoint test (dev branch)

### Documentation
- Full technical spec at docs/k8s-refactor-spec.md
- Architecture diagrams
- Migration path

Closes #138
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR successfully refactors the deployment infrastructure from Docker Compose to Kubernetes while preserving all existing E2E test coverage. The implementation uses Kustomize for configuration management with separate base manifests and overlays for CI and local environments.

Key Changes:

  • Adds comprehensive Kubernetes manifests using Kustomize structure (base + overlays) for all services (ethereum, eigenlayer, signer, nodes, router)
  • Implements new k8s-integration-test.yml CI workflow using Kind for E2E testing with all existing test scenarios preserved
  • Provides detailed technical specification documentation covering architecture, deployment flows, and migration strategy

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
.github/workflows/k8s-integration-test.yml New CI workflow for K8s-based E2E testing using Kind, implementing counter increment, fast aggregation, and ingress tests
k8s/base/namespace.yaml Defines commonware-avs namespace with standard Kubernetes labels
k8s/base/configmap.yaml Contains AVS configuration (quorum settings, operator addresses) and environment variables including test account keys
k8s/base/pvc.yaml Persistent volume claim for sharing operator keys and deployment artifacts across pods
k8s/base/rbac.yaml Service account and RBAC permissions for init containers to check Job completion status
k8s/base/kustomization.yaml Base Kustomize configuration aggregating all K8s resources
k8s/base/ethereum/deployment.yaml Ethereum node deployment with readiness/liveness probes
k8s/base/ethereum/service.yaml ClusterIP service exposing Ethereum RPC endpoint
k8s/base/eigenlayer/job.yaml Job for EigenLayer contract deployment and BLS key generation
k8s/base/signer/deployment.yaml Cerberus signer deployment with init container waiting for eigenlayer-setup completion
k8s/base/signer/service.yaml Service exposing signer gRPC and metrics ports
k8s/base/nodes/statefulset.yaml StatefulSet for 3 AVS nodes with dynamic port assignment and stable network identities
k8s/base/nodes/service.yaml Headless service for StatefulSet pod discovery
k8s/base/router/deployment.yaml Router deployment with init container waiting for all nodes to be ready
k8s/base/router/service.yaml Service exposing router app and ingress endpoints
k8s/overlays/ci/kustomization.yaml CI-specific overlay with image name overrides and patches
k8s/overlays/ci/patches/image-pull-policy.yaml Sets imagePullPolicy to Never/IfNotPresent for CI environment
k8s/overlays/ci/patches/nodeport-services.yaml Converts services to NodePort for external access in Kind
k8s/overlays/ci/patches/pvc-storage-class.yaml Ensures PVC uses 'standard' storage class available in Kind
k8s/overlays/local/kustomization.yaml Local development overlay configuration
k8s/overlays/local/patches/nodeport-services.yaml NodePort services for local cluster access
k8s/kind-config.yaml Kind cluster configuration with port mappings for ethereum, router app, and ingress
docs/k8s-refactor-spec.md Comprehensive technical specification covering architecture, flows, manifests, and migration strategy

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread k8s/base/pvc.yaml Outdated
resources:
requests:
storage: 1Gi
storageClassName: standard
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base PVC specifies storageClassName: standard, which may not exist in all Kubernetes clusters. The "standard" storage class is not guaranteed to exist and varies by cluster provider. Consider either: (1) removing the storageClassName to use the cluster's default storage class, or (2) documenting the assumption that a "standard" storage class must exist. The CI overlay already patches this, which is good practice.

Suggested change
storageClassName: standard

Copilot uses AI. Check for mistakes.
Comment thread k8s/base/rbac.yaml
Comment on lines +1 to +29
apiVersion: v1
kind: ServiceAccount
metadata:
name: avs-service-account
namespace: commonware-avs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: job-reader
namespace: commonware-avs
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: job-reader-binding
namespace: commonware-avs
subjects:
- kind: ServiceAccount
name: avs-service-account
namespace: commonware-avs
roleRef:
kind: Role
name: job-reader
apiGroup: rbac.authorization.k8s.io
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The service account avs-service-account is assigned to the signer deployment (line 62 in signer/deployment.yaml) and nodes StatefulSet (line 98 in nodes/statefulset.yaml), giving them permissions to read Jobs. However, only the signer and nodes need this permission for their init containers to check job completion. The router and ethereum deployments don't specify a serviceAccountName, so they'll use the default service account. This is correct, but consider documenting why only specific workloads need this RBAC permission.

Copilot uses AI. Check for mistakes.
Comment thread docs/k8s-refactor-spec.md Outdated
spec:
initContainers:
- name: wait-for-eigenlayer
image: bitnami/kubectl:latest
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation shows bitnami/kubectl:latest but the actual implementation uses bitnami/kubectl:1.28 (line 20 in signer/deployment.yaml and line 22 in nodes/statefulset.yaml). The documentation should be updated to match the implementation, which correctly pins a specific version instead of using latest.

Suggested change
image: bitnami/kubectl:latest
image: bitnami/kubectl:1.28

Copilot uses AI. Check for mistakes.
Comment thread docs/k8s-refactor-spec.md Outdated
└── local/
├── kustomization.yaml
└── patches/
└── node-ports.yaml
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation refers to node-ports.yaml but the actual file is named nodeport-services.yaml (in k8s/overlays/local/patches/). Update the documentation to match the actual filename.

Suggested change
└── node-ports.yaml
└── nodeport-services.yaml

Copilot uses AI. Check for mistakes.
Comment thread docs/k8s-refactor-spec.md Outdated
├── ci/
│ ├── kustomization.yaml
│ └── patches/
│ └── image-pull-policy.yaml
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation shows only image-pull-policy.yaml under CI patches, but the actual CI overlay includes three patch files: image-pull-policy.yaml, nodeport-services.yaml, and pvc-storage-class.yaml (see k8s/overlays/ci/kustomization.yaml lines 10-12). The documentation should list all three patches.

Suggested change
── image-pull-policy.yaml
── image-pull-policy.yaml
│ ├── nodeport-services.yaml
│ └── pvc-storage-class.yaml

Copilot uses AI. Check for mistakes.
Comment thread k8s/base/namespace.yaml Outdated
name: commonware-avs
labels:
app.kubernetes.io/name: commonware-avs
app.kubernetes.io/part-of: commonware-restaking
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace defines a label app.kubernetes.io/part-of: commonware-restaking, but the kustomization.yaml commonLabels (line 22 in k8s/base/kustomization.yaml) uses app.kubernetes.io/part-of: commonware-avs. This inconsistency could cause confusion. Consider aligning these labels to use the same value.

Suggested change
app.kubernetes.io/part-of: commonware-restaking
app.kubernetes.io/part-of: commonware-avs

Copilot uses AI. Check for mistakes.
kubectl logs deployment/router -n commonware-avs --tail 50
exit 1
fi

Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The success message states "10 seconds" but the test actually waits for 60 seconds (line 219: sleep 60). This should be updated to say "~60 seconds" to match the actual wait time.

Copilot uses AI. Check for mistakes.
Comment thread k8s/kind-config.yaml
Comment on lines +10 to +12
- containerPort: 30000
hostPort: 4000
protocol: TCP
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The hostPort mapping for the router app port maps containerPort 30000 to hostPort 4000, but the workflow accesses the router on port 3000 (not 4000). This appears to be intentional since the router service uses NodePort 30000 for the app port internally, but it creates confusion. The workflow doesn't actually use port 4000 anywhere - it only uses port 8545 (ethereum) and 8080 (ingress). Consider removing this unused port mapping or documenting why it's included if it's for future use or local development.

Suggested change
- containerPort: 30000
hostPort: 4000
protocol: TCP

Copilot uses AI. Check for mistakes.
Comment thread k8s/base/configmap.yaml Outdated
"address": "router"
}
router_orchestrator.json: |
{
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A private key is hardcoded in the ConfigMap. While this appears to be for testing/development purposes, it should be documented or have a clear comment indicating this is a test-only key and should never be used in production. Consider adding a comment above this field explaining its test-only nature.

Suggested change
{
{
"__comment": "This privateKey is for testing/development only. NEVER use this key in production.",

Copilot uses AI. Check for mistakes.
Comment thread docs/k8s-refactor-spec.md Outdated
- "3000"
volumeMounts:
- name: nodes-data
mountPath: /app/.nodes
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The router pod mounts the entire nodes-data PVC at /app/.nodes, which also stores the BLS operator private keys generated by the eigenlayer-setup job and used by the AVS nodes. This gives the internet-facing router process unnecessary read access to highly sensitive key material, so a compromise of the router (e.g., via an RCE or path traversal bug) would allow an attacker to exfiltrate operator keys and impersonate AVS nodes. To reduce blast radius, avoid mounting the full nodes-data volume into the router and instead expose only the minimal artifact it needs (e.g., a separate volume or subPath for avs_deploy.json) or refactor so keys reside in a dedicated Secret/volume that is not accessible to the router.

Suggested change
mountPath: /app/.nodes
mountPath: /app/.nodes/avs_deploy.json
subPath: avs_deploy.json

Copilot uses AI. Check for mistakes.
Fixes based on Copilot and manual review:

Security improvements:
- Move router private key from ConfigMap to Secret
- Router now only mounts avs_deploy.json instead of entire .nodes directory
  (prevents access to operator keys if router is compromised)

Port handling fix:
- All AVS nodes now listen on port 3001 (each pod has its own IP in K8s)
- Simplifies headless service configuration
- Fixes readiness/liveness probe port mismatch

Configuration fixes:
- Remove storageClassName from base PVC (use cluster default)
- Fix namespace label inconsistency (commonware-avs)
- Add comments to kind port mappings explaining usage
- Update operator addresses in config to use port 3001

Documentation updates:
- Add rbac.yaml and secret.yaml to directory layout
- Add all CI overlay patches to directory layout
- Fix bitnami/kubectl version to 1.28 (not latest)
- Update examples to reflect security improvements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Docker Compose to Kubernetes

2 participants