Skip to content

Feat/spire infrastructure deployment#251

Open
mahil-2040 wants to merge 9 commits intovolcano-sh:mainfrom
mahil-2040:feat/spire-infrastructure-deployment
Open

Feat/spire infrastructure deployment#251
mahil-2040 wants to merge 9 commits intovolcano-sh:mainfrom
mahil-2040:feat/spire-infrastructure-deployment

Conversation

@mahil-2040
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

Which issue(s) this PR fixes:
Fixes a Part of #243

What this PR does / why we need it:

Description

This PR implements the complete foundational SPIRE infrastructure from scratch, setting up the secure workload identity framework required for our zero-trust mTLS architecture outlined in the auth-proposal.md. Everything is cleanly bundled into our Helm deployments and gated behind a spire.enabled flag to ensure 100% backward compatibility for environments that do not require strict internal mTLS.


Core Changes

SPIRE Core Infrastructure

Deployed the spire-server (StatefulSet) and spire-agent (DaemonSet) along with their respective configuration ConfigMaps, NodeAttestation schemas, and UNIX socket hostPaths.


Dynamic Identity Registration

Deployed the spire-controller-manager as a sidecar to the server and introduced declarative ClusterSPIFFEID CRD bindings. This allows SPIRE to dynamically map and issue identities to the agentcube-router, workloadmanager, and ephemeral picod sandboxes natively via Kubernetes lifecycle events.


Certificate Delivery (spiffe-helper)

Injected lightweight spiffe-helper sidecars and emptyDir volumes into the Router and WorkloadManager deployments. These natively sync SVIDs (svid.pem, svid_bundle.pem, svid_key.pem) to disk so our components can natively execute mTLS rotations.


Robust & Hardened RBAC

Handcrafted granular ServiceAccounts, ClusterRoles, and RoleBindings for the SPIRE components, ensuring the spire-agent has secure nodes/proxy kubelet attestation powers, while maintaining tight scope limits across the board.


Fully Configurable Architecture

Centralized control over the cluster.local trust domain and image versions directly inside values.yaml.


Verification

  1. All Component Pods Boot Successfully (Healthy & Ready)
  2. Dynamic SPIFFE Identity Registration
  3. Successful Certificate Delivery to Shared Volumes
Screenshot 2026-04-02 234119 Screenshot 2026-04-02 234131

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
yes

Introduced optional SPIRE deployment infrastructure for internal workload identity and mTLS setup. Cluster operators can now provision a complete SPIRE server, agent, and dynamic controller manager suite by setting `spire.enabled=true` in their Helm values. This remains disabled by default to ensure 100% backward-compatible deployments.

Copilot AI review requested due to automatic review settings April 2, 2026 18:49
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SPIRE integration to provide internal workload identities across the AgentCube components. It adds the necessary Kubernetes manifests for the SPIRE Server, Agent, and Controller Manager, along with spiffe-helper sidecars for the router and workload manager. Key feedback points out a critical data persistence issue where emptyDir is used for the SPIRE Server's data, which would lead to CA loss on pod restarts. Additionally, there are concerns regarding the lack of namespace scoping in sandbox SPIFFE IDs and the insecure default for kubelet verification in the agent configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional SPIRE-based workload identity stack to the AgentCube Helm chart (under manifests/charts/base), intended to support the project’s internal zero-trust mTLS direction described in auth-proposal.md. The deployment is controlled via a new spire.enabled value and introduces SPIRE server/agent/controller-manager resources plus spiffe-helper sidecars for Router and WorkloadManager.

Changes:

  • Add a new spire: section in Helm values (trust domain, images, resources, helper cert paths).
  • Inject spiffe-helper sidecars + socket/cert volumes into Router and WorkloadManager when SPIRE is enabled.
  • Add SPIRE server/agent/controller-manager manifests, RBAC, webhook service/config, and ClusterSPIFFEID registrations.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
manifests/charts/base/values.yaml Adds spire configuration; changes router SA/RBAC defaults.
manifests/charts/base/templates/agentcube-router.yaml Adds optional spiffe-helper sidecar + volumes for Router.
manifests/charts/base/templates/workloadmanager.yaml Adds optional spiffe-helper sidecar + volumes for WorkloadManager.
manifests/charts/base/templates/rbac-router.yaml Creates Router ServiceAccount when SPIRE is enabled; keeps RBAC conditional.
manifests/charts/base/templates/spire/spire-server.yaml New SPIRE server StatefulSet + config + service; controller-manager sidecar included.
manifests/charts/base/templates/spire/spire-agent.yaml New SPIRE agent DaemonSet + config and node attestation token projection.
manifests/charts/base/templates/spire/spire-controller-manager.yaml New controller-manager config (ConfigMap).
manifests/charts/base/templates/spire/validating-webhook.yaml New webhook service + ValidatingWebhookConfiguration.
manifests/charts/base/templates/spire/spiffe-helper-config.yaml New shared spiffe-helper configuration ConfigMap.
manifests/charts/base/templates/spire/rbac.yaml New SPIRE server/agent ServiceAccounts + cluster-scoped RBAC.
manifests/charts/base/templates/spire/cluster-spiffe-ids.yaml New ClusterSPIFFEID resources for Router/WorkloadManager/sandboxes.

Comment on lines +34 to +36
serviceAccountName: "agentcube-router"
rbac:
create: false
create: true
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says SPIRE is fully gated behind spire.enabled for backward compatibility, but this chart now defaults router.serviceAccountName to agentcube-router and router.rbac.create to true, which changes the default resources/permissions even when spire.enabled=false. Consider reverting these defaults (or gating them behind spire.enabled) to preserve existing install behavior, especially since enabling router.rbac.create creates secret CRUD RBAC by default.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 2, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.32%. Comparing base (845b798) to head (a00006d).
⚠️ Report is 155 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #251      +/-   ##
==========================================
+ Coverage   35.60%   43.32%   +7.71%     
==========================================
  Files          29       30       +1     
  Lines        2533     2613      +80     
==========================================
+ Hits          902     1132     +230     
+ Misses       1505     1358     -147     
+ Partials      126      123       -3     
Flag Coverage Δ
unittests 43.32% <ø> (+7.71%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

@mahil-2040
Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu @acsoto @YaoZengzeng PTAL!

…les, ClusterRoleBindings) for SPIRE components gated by spire.enabled

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…omponents gated by spire.enabled

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…onents gated by spire.enabled

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…RDs for SPIRE components gated by spire.enabled

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…pire-server instead of separate deployment

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…station RBAC

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…ase-unique global resources

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
@mahil-2040 mahil-2040 force-pushed the feat/spire-infrastructure-deployment branch from 8ea4c84 to 4f88c6a Compare April 5, 2026 07:13
@hzxuzhonghu hzxuzhonghu requested a review from Copilot April 8, 2026 03:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment on lines +1 to +5
{{- if .Values.spire.enabled }}
# Router registration
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chart creates ClusterSPIFFEID resources, but it doesn’t appear to install the required spire-controller-manager CRDs (no spire.spiffe.io CRDs under manifests/charts/base/crds/). With spire.enabled=true, Helm/Kubernetes will reject these manifests unless the CRDs are pre-installed, causing installation/upgrade to fail. Consider shipping the CRDs (or a dedicated dependency/subchart) and documenting the install order/requirements clearly if they must be applied separately.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +80
certDir: "/run/spire/certs"
certFileName: "svid.pem"
keyFileName: "svid_key.pem"
bundleFileName: "svid_bundle.pem"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BETTER add some comments

kind: Role
name: {{ .Values.router.serviceAccountName | default "agentcube-router" }}
{{- end }}
{{- if or .Values.router.rbac.create .Values.spire.enabled }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can remove .Values.router.rbac.create

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with .Values.spire.enabled, what sepcial permission needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can remove .Values.router.rbac.create

Removed router.rbac.create the router always needs its SA and RBAC permissions, so these are now created unconditionally. Also cleaned up the unused toggle from values.yaml

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with .Values.spire.enabled, what sepcial permission needed?

SPIRE doesn't require any special permissions on the router's SA. It only needs the SA to exist so the k8s_psat attestor can match the pod's identity by SA name against the ClusterSPIFFEID CRD. All SPIRE-specific RBAC is handled separately in spire/rbac.yaml

rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see router always create/get secret now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed other permission then ["get", "create"]. You correctly spotted that the router's JWTManager only calls Secrets().Create() on first boot and Secrets().Get() on subsequent restarts to load the existing identity keypair.

… defaults and added comments for better understanding

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
@mahil-2040 mahil-2040 requested a review from hzxuzhonghu April 11, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants