Feat/spire infrastructure deployment#251
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces SPIRE integration to provide internal workload identities across the AgentCube components. It adds the necessary Kubernetes manifests for the SPIRE Server, Agent, and Controller Manager, along with spiffe-helper sidecars for the router and workload manager. Key feedback points out a critical data persistence issue where emptyDir is used for the SPIRE Server's data, which would lead to CA loss on pod restarts. Additionally, there are concerns regarding the lack of namespace scoping in sandbox SPIFFE IDs and the insecure default for kubelet verification in the agent configuration.
There was a problem hiding this comment.
Pull request overview
This PR adds an optional SPIRE-based workload identity stack to the AgentCube Helm chart (under manifests/charts/base), intended to support the project’s internal zero-trust mTLS direction described in auth-proposal.md. The deployment is controlled via a new spire.enabled value and introduces SPIRE server/agent/controller-manager resources plus spiffe-helper sidecars for Router and WorkloadManager.
Changes:
- Add a new
spire:section in Helm values (trust domain, images, resources, helper cert paths). - Inject
spiffe-helpersidecars + socket/cert volumes into Router and WorkloadManager when SPIRE is enabled. - Add SPIRE server/agent/controller-manager manifests, RBAC, webhook service/config, and ClusterSPIFFEID registrations.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| manifests/charts/base/values.yaml | Adds spire configuration; changes router SA/RBAC defaults. |
| manifests/charts/base/templates/agentcube-router.yaml | Adds optional spiffe-helper sidecar + volumes for Router. |
| manifests/charts/base/templates/workloadmanager.yaml | Adds optional spiffe-helper sidecar + volumes for WorkloadManager. |
| manifests/charts/base/templates/rbac-router.yaml | Creates Router ServiceAccount when SPIRE is enabled; keeps RBAC conditional. |
| manifests/charts/base/templates/spire/spire-server.yaml | New SPIRE server StatefulSet + config + service; controller-manager sidecar included. |
| manifests/charts/base/templates/spire/spire-agent.yaml | New SPIRE agent DaemonSet + config and node attestation token projection. |
| manifests/charts/base/templates/spire/spire-controller-manager.yaml | New controller-manager config (ConfigMap). |
| manifests/charts/base/templates/spire/validating-webhook.yaml | New webhook service + ValidatingWebhookConfiguration. |
| manifests/charts/base/templates/spire/spiffe-helper-config.yaml | New shared spiffe-helper configuration ConfigMap. |
| manifests/charts/base/templates/spire/rbac.yaml | New SPIRE server/agent ServiceAccounts + cluster-scoped RBAC. |
| manifests/charts/base/templates/spire/cluster-spiffe-ids.yaml | New ClusterSPIFFEID resources for Router/WorkloadManager/sandboxes. |
manifests/charts/base/values.yaml
Outdated
| serviceAccountName: "agentcube-router" | ||
| rbac: | ||
| create: false | ||
| create: true |
There was a problem hiding this comment.
The PR description says SPIRE is fully gated behind spire.enabled for backward compatibility, but this chart now defaults router.serviceAccountName to agentcube-router and router.rbac.create to true, which changes the default resources/permissions even when spire.enabled=false. Consider reverting these defaults (or gating them behind spire.enabled) to preserve existing install behavior, especially since enabling router.rbac.create creates secret CRUD RBAC by default.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #251 +/- ##
==========================================
+ Coverage 35.60% 43.32% +7.71%
==========================================
Files 29 30 +1
Lines 2533 2613 +80
==========================================
+ Hits 902 1132 +230
+ Misses 1505 1358 -147
+ Partials 126 123 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@hzxuzhonghu @acsoto @YaoZengzeng PTAL! |
…les, ClusterRoleBindings) for SPIRE components gated by spire.enabled Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…omponents gated by spire.enabled Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…onents gated by spire.enabled Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…RDs for SPIRE components gated by spire.enabled Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…pire-server instead of separate deployment Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…station RBAC Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…ase-unique global resources Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
8ea4c84 to
4f88c6a
Compare
| {{- if .Values.spire.enabled }} | ||
| # Router registration | ||
| apiVersion: spire.spiffe.io/v1alpha1 | ||
| kind: ClusterSPIFFEID | ||
| metadata: |
There was a problem hiding this comment.
This chart creates ClusterSPIFFEID resources, but it doesn’t appear to install the required spire-controller-manager CRDs (no spire.spiffe.io CRDs under manifests/charts/base/crds/). With spire.enabled=true, Helm/Kubernetes will reject these manifests unless the CRDs are pre-installed, causing installation/upgrade to fail. Consider shipping the CRDs (or a dedicated dependency/subchart) and documenting the install order/requirements clearly if they must be applied separately.
| certDir: "/run/spire/certs" | ||
| certFileName: "svid.pem" | ||
| keyFileName: "svid_key.pem" | ||
| bundleFileName: "svid_bundle.pem" |
| kind: Role | ||
| name: {{ .Values.router.serviceAccountName | default "agentcube-router" }} | ||
| {{- end }} | ||
| {{- if or .Values.router.rbac.create .Values.spire.enabled }} |
There was a problem hiding this comment.
seems we can remove .Values.router.rbac.create
There was a problem hiding this comment.
with .Values.spire.enabled, what sepcial permission needed?
There was a problem hiding this comment.
seems we can remove .Values.router.rbac.create
Removed router.rbac.create the router always needs its SA and RBAC permissions, so these are now created unconditionally. Also cleaned up the unused toggle from values.yaml
There was a problem hiding this comment.
with .Values.spire.enabled, what sepcial permission needed?
SPIRE doesn't require any special permissions on the router's SA. It only needs the SA to exist so the k8s_psat attestor can match the pod's identity by SA name against the ClusterSPIFFEID CRD. All SPIRE-specific RBAC is handled separately in spire/rbac.yaml
| rules: | ||
| - apiGroups: [""] | ||
| resources: ["secrets"] | ||
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] |
There was a problem hiding this comment.
I can see router always create/get secret now
There was a problem hiding this comment.
removed other permission then ["get", "create"]. You correctly spotted that the router's JWTManager only calls Secrets().Create() on first boot and Secrets().Get() on subsequent restarts to load the existing identity keypair.
… defaults and added comments for better understanding Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
What type of PR is this?
/kind feature
Which issue(s) this PR fixes:
Fixes a Part of #243
What this PR does / why we need it:
Description
This PR implements the complete foundational SPIRE infrastructure from scratch, setting up the secure workload identity framework required for our zero-trust mTLS architecture outlined in the auth-proposal.md. Everything is cleanly bundled into our Helm deployments and gated behind a
spire.enabledflag to ensure 100% backward compatibility for environments that do not require strict internal mTLS.Core Changes
SPIRE Core Infrastructure
Deployed the
spire-server(StatefulSet) andspire-agent(DaemonSet) along with their respective configuration ConfigMaps, NodeAttestation schemas, and UNIX socket hostPaths.Dynamic Identity Registration
Deployed the
spire-controller-manageras a sidecar to the server and introduced declarativeClusterSPIFFEIDCRD bindings. This allows SPIRE to dynamically map and issue identities to theagentcube-router,workloadmanager, and ephemeralpicod sandboxesnatively via Kubernetes lifecycle events.Certificate Delivery (
spiffe-helper)Injected lightweight
spiffe-helpersidecars and emptyDir volumes into the Router and WorkloadManager deployments. These natively sync SVIDs (svid.pem,svid_bundle.pem,svid_key.pem) to disk so our components can natively execute mTLS rotations.Robust & Hardened RBAC
Handcrafted granular
ServiceAccounts,ClusterRoles, andRoleBindingsfor the SPIRE components, ensuring the spire-agent has secure nodes/proxykubeletattestation powers, while maintaining tight scope limits across the board.Fully Configurable Architecture
Centralized control over the
cluster.localtrust domain and image versions directly insidevalues.yaml.Verification
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
yes