-
Notifications
You must be signed in to change notification settings - Fork 308
Fix Issue 2023: Validate VolumeClaimTemplate overrides contain spec and provide helpful error messages when they don't
#2024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ba2ce80 to
e4c7b88
Compare
|
I tested this and it works: Not sure why the desired size is: 0, but I'm guessing a regression in our helm chart. The logging improvements on this PR made this a lot easier to figure out though. Still troubleshooting the actual issue. |
This is the function that gets the desired capacity, and all it does is grab it from the rabbitmq cluster spec: https://github.com/rabbitmq/cluster-operator/blob/main/controllers/reconcile_persistence.go#L15-L18 I'm not opposed to adding more logging at higher verbosity level, if it's helpful. However, replacing |
|
@Zerpet: Thanks! I completely agree, and I saw the docs for |
d7d0c01 to
6d00634
Compare
|
@Zerpet I believe that I've tested my changes end-to-end, with only the CRD applied, only the operator update, and with both applied, and it seems to work as intended. I also created an ephemeral cluster to make sure that I didn't cause a regression there. Everything looks good to me! I tested my latest changes to the CRDs (without fixing our helm boo boo): I deployed the upstream CRDs to test the new error logging and installed the broken helm chart and got the following: Relevant logging improvement: {"error":"PVC template 'persistence' has spec.resources.requests.storage=0 (or missing). If using override.statefulSet.spec.volumeClaimTemplates, you must provide the COMPLETE template including spec.resources.requests.storage. Overrides replace the entire volumeClaimTemplate, not merge with it"}Full log: {
"level":"error",
"ts":"2025-12-10T16:09:10Z",
"msg":"Failed to determine PVC capacity: PVC template 'persistence' has spec.resources.requests.storage=0 (or missing). If using override.statefulSet.spec.volumeClaimTemplates, you must provide the COMPLETE template including spec.resources.requests.storage. Overrides replace the entire volumeClaimTemplate, not merge with it",
"controller":"rabbitmqcluster",
"controllerGroup":"rabbitmq.com",
"controllerKind":"RabbitmqCluster",
"RabbitmqCluster":{
"name":"rabbitmq",
"namespace":"rabbitmq"
},
"namespace":"rabbitmq",
"name":"rabbitmq",
"reconcileID":"297e7188-a3e5-4459-964e-48d59ab98353",
"error":"PVC template 'persistence' has spec.resources.requests.storage=0 (or missing). If using override.statefulSet.spec.volumeClaimTemplates, you must provide the COMPLETE template including spec.resources.requests.storage. Overrides replace the entire volumeClaimTemplate, not merge with it",
"stacktrace":"github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).reconcilePVC\n\t/workspace/controllers/reconcile_persistence.go:20\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile\n\t/workspace/controllers/rabbitmqcluster_controller.go:228\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:461\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:296"
}I tested creating an ephemeral RabbitMQ cluster with both the updated Cluster definition: apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
annotations:
meta.helm.sh/release-name: rabbitmq
meta.helm.sh/release-namespace: rabbitmq
creationTimestamp: "2025-12-10T16:30:59Z"
finalizers:
- deletion.finalizers.rabbitmqclusters.rabbitmq.com
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
name: rabbitmq
namespace: rabbitmq
resourceVersion: "541966441"
uid: cc427fe4-21a7-493e-91db-f30bd79904e7
spec:
delayStartSeconds: 30
image: rabbitmq:4.1.3-management
imagePullSecrets:
- name: docker-hub
override: {}
persistence:
storage: "0"
storageClassName: managed-csi-persist
rabbitmq:
additionalConfig: |
cluster_partition_handling = pause_minority
collect_statistics_interval = 10000
disk_free_limit.relative = 1.0
queue_master_locator = min-masters
vm_memory_high_watermark_paging_ratio = 0.99
additionalPlugins:
- rabbitmq_shovel
- rabbitmq_shovel_management
replicas: 5
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: 500m
memory: 2Gi
secretBackend:
externalSecret:
name: ""
service:
type: ClusterIP
terminationGracePeriodSeconds: 604800
tls: {}
status:
binding:
name: rabbitmq-default-user
conditions:
- lastTransitionTime: "2025-12-10T16:31:49Z"
reason: AllPodsAreReady
status: "True"
type: AllReplicasReady
- lastTransitionTime: "2025-12-10T16:31:15Z"
reason: AtLeastOneEndpointAvailable
status: "True"
type: ClusterAvailable
- lastTransitionTime: "2025-12-10T16:31:00Z"
reason: NoWarnings
status: "True"
type: NoWarnings
- lastTransitionTime: "2025-12-10T16:31:50Z"
message: Finish reconciling
reason: Success
status: "True"
type: ReconcileSuccess
defaultUser:
secretReference:
keys:
password: password
username: username
name: rabbitmq-default-user
namespace: rabbitmq
serviceReference:
name: rabbitmq
namespace: rabbitmq
observedGeneration: 2Operator logs: {"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-erlang-cookie of Type *v1.Secret","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-default-user of Type *v1.Secret","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-plugins-conf of Type *v1.ConfigMap","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server-conf of Type *v1.ConfigMap","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server of Type *v1.ServiceAccount","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-peer-discovery of Type *v1.Role","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server of Type *v1.RoleBinding","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"successfully annotated","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}Pods running: |
|
I'm still trying to get the system tests to run correctly. Can we try running them in CI? 🤔 |
|
@Zerpet I updated my PR to address the root causes of my issues, and I think these changes provide a good UX for operator users. Root Cause Analysis
Symptoms:
Fix applied:
|
spec and provide helpful error messages when they don't
Prevents silent reconciliation failures when override.statefulSet.spec.volumeClaimTemplates is provided with incomplete configuration (e.g., only metadata.annotations without spec.resources.requests.storage). Changes: - Add CRD validation requiring PVC spec field (rejected at admission time) - Detect and error on storage=0 with hints about override behavior - Show actual values in shrink errors: "(existing: 20Gi, desired: 5Gi)" Fixes rabbitmq#2023
6d00634 to
1a2bb15
Compare
|
Ok, I removed the test changes, requested a CLA and modified the commit to match that email address and tried to match the projects conventions. Please let me know if there's anything else I can/need to do to shepard this PR. Thanks for all your help! |
|
CLA signed |
|
I confirm that we have received a signed CLA from @jmealo. Thank you for contributing! |
|
Thank you for investigating and contributing! I'll try to review this change tomorrow. FYI there's a expected unit test failure, since the error message has changed. |
This PR is intended to fix #2023.
Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Summary Of Changes
This PR adds validation and improved error messages to prevent silent reconciliation failures when users provide incomplete
volumeClaimTemplatesin theoverride.statefulSet.specsection.Problem
When users add only
metadata.annotationstovolumeClaimTemplates(e.g., for PVC autoresizer annotations) without including the requiredspec.resources.requests.storagefield, the override replaces the entire templaterather than merging. This results in:
storage: 0(missing field defaults to zero)Changes
CRD Schema Validation (
api/v1beta1/rabbitmqcluster_types.go)+kubebuilder:validation:Requiredmarker toPersistentVolumeClaim.Specfieldspec.override.statefulSet.spec.volumeClaimTemplates[0].spec: Required valueImproved Error Messages (
controllers/reconcile_persistence.go)storage=0or missing storage fieldspec.resources.requests.storagemust be providedEnhanced Shrink Error (
internal/scaling/scaling.go)"shrinking not supported (existing: 20Gi, desired: 5Gi)"Regenerated CRD (
config/crd/bases/rabbitmq.com_rabbitmqclusters.yaml)What This Does NOT Change
.Cmp())storage: "0") continues to work correctlyAdditional Context
Root Cause: The issue typically manifests when users (or LLM tools) treat the override as a merge instead of a replace (and leave out the
specportion, causingsizeto default to0)Since
override.statefulSet.spec.volumeClaimTemplatesis a full replacement (not a strategic merge), the above results in a PVC template with no storage capacity, which defaults to0.After This PR:
Manual Testing Scenarios
These all passed for me on a fresh Azure AKS cluster that did not have RabbitMQ or the operator deployed.
✅ Test 1: Invalid override is rejected
Expected: Rejected with "spec: Required value"
✅ Test 2: Valid ephemeral cluster still works
Expected: Cluster created successfully, no PVCs
✅ Test 3: Valid override with complete spec
Expected: Cluster created successfully with 20Gi PVCs
Test Results