Skip to content

Conversation

@jmealo
Copy link

@jmealo jmealo commented Dec 9, 2025

This PR is intended to fix #2023.

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

This PR adds validation and improved error messages to prevent silent reconciliation failures when users provide incomplete volumeClaimTemplates in the override.statefulSet.spec section.

Problem

When users add only metadata.annotations to volumeClaimTemplates (e.g., for PVC autoresizer annotations) without including the required spec.resources.requests.storage field, the override replaces the entire template
rather than merging. This results in:

  • PVC templates with storage: 0 (missing field defaults to zero)
  • Silent reconciliation loop failures every ~15 minutes
  • No clear indication to users about what's wrong
  • Cluster appears to be running but changes don't apply

Changes

  1. CRD Schema Validation (api/v1beta1/rabbitmqcluster_types.go)

    • Added +kubebuilder:validation:Required marker to PersistentVolumeClaim.Spec field
    • Kubernetes API server now rejects incomplete PVC templates at admission time
    • Users get immediate feedback: spec.override.statefulSet.spec.volumeClaimTemplates[0].spec: Required value
  2. Improved Error Messages (controllers/reconcile_persistence.go)

    • Detects when PVC template has storage=0 or missing storage field
    • Returns helpful error explaining that overrides replace entire templates (don't merge)
    • Clarifies that complete spec.resources.requests.storage must be provided
  3. Enhanced Shrink Error (internal/scaling/scaling.go)

    • Shows actual capacity values in error message: "shrinking not supported (existing: 20Gi, desired: 5Gi)"
    • Makes troubleshooting capacity issues more intuitive
  4. Regenerated CRD (config/crd/bases/rabbitmq.com_rabbitmqclusters.yaml)

    • Updated with new validation requirements

What This Does NOT Change

  • No changes to comparison logic (still uses .Cmp())
  • No changes to test files
  • No behavioral changes to valid configurations
  • Ephemeral storage (storage: "0") continues to work correctly

Additional Context

Root Cause: The issue typically manifests when users (or LLM tools) treat the override as a merge instead of a replace (and leave out the spec portion, causing size to default to 0)

override:
  statefulSet:
    spec:
      volumeClaimTemplates:
        - metadata:
            annotations:
              resize.topolvm.io/storage_limit: 100Gi
          # Missing spec.resources.requests.storage!

Since override.statefulSet.spec.volumeClaimTemplates is a full replacement (not a strategic merge), the above results in a PVC template with no storage capacity, which defaults to 0.

After This PR:

  • Invalid configs are rejected immediately by the API server
  • If somehow they get through, reconciliation fails fast with a clear error
  • Ephemeral storage continues to work (tested)
  • Works as intended when deployed with latest operator and CRDs
  • Works as intended when only latest CRDs are deployed (without operator update)
  • Works as intended (minus immediate feedback) when only latest operator is deployed (without latest CRDS)

Manual Testing Scenarios

These all passed for me on a fresh Azure AKS cluster that did not have RabbitMQ or the operator deployed.

✅ Test 1: Invalid override is rejected

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test-invalid
spec:
  replicas: 1
  override:
    statefulSet:
      spec:
        volumeClaimTemplates:
          - metadata:
              name: persistence
              annotations:
                resize.topolvm.io/storage_limit: 100Gi
EOF
The RabbitmqCluster "test-invalid" is invalid: spec.override.statefulSet.spec.volumeClaimTemplates[0].spec: Required value

Expected: Rejected with "spec: Required value"

✅ Test 2: Valid ephemeral cluster still works

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test-ephemeral
spec:
  replicas: 1
  persistence:
    storage: "0"
EOF
rabbitmqcluster.rabbitmq.com/test-ephemeral created

Expected: Cluster created successfully, no PVCs

✅ Test 3: Valid override with complete spec

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test-valid
spec:
  replicas: 1
  override:
    statefulSet:
      spec:
        volumeClaimTemplates:
          - metadata:
              name: persistence
              annotations:
                resize.topolvm.io/storage_limit: 100Gi
            spec:
              accessModes: [ReadWriteOnce]
              resources:
                requests:
                  storage: 20Gi
EOF
rabbitmqcluster.rabbitmq.com/test-valid created

Expected: Cluster created successfully with 20Gi PVCs

Test Results

  • ✅ All unit tests pass (347 specs)
  • ✅ All integration tests pass (59 specs)
  • ✅ Ephemeral storage clusters work correctly
  • ✅ Invalid configs are rejected at admission time
  • ✅ Improved error messages appear in operator logs when applicable

@jmealo jmealo force-pushed the feat/fix-issue-2023 branch 3 times, most recently from ba2ce80 to e4c7b88 Compare December 9, 2025 22:15
@jmealo
Copy link
Author

jmealo commented Dec 9, 2025

I tested this and it works:

Failed to scale PVCs: shrinking persistent volumes is not supported (existing: 20Gi, desired: 0)

Not sure why the desired size is: 0, but I'm guessing a regression in our helm chart. The logging improvements on this PR made this a lot easier to figure out though. Still troubleshooting the actual issue.

@Zerpet
Copy link
Member

Zerpet commented Dec 10, 2025

I tested this and it works:

Failed to scale PVCs: shrinking persistent volumes is not supported (existing: 20Gi, desired: 0)

Not sure why the desired size is: 0, but I'm guessing a regression in our helm chart. The logging improvements on this PR made this a lot easier to figure out though. Still troubleshooting the actual issue.

This is the function that gets the desired capacity, and all it does is grab it from the rabbitmq cluster spec:

https://github.com/rabbitmq/cluster-operator/blob/main/controllers/reconcile_persistence.go#L15-L18

I'm not opposed to adding more logging at higher verbosity level, if it's helpful. However, replacing Cmp() to do exactly what Cmp() does it's very unlikely to be accepted as contribution:

https://github.com/kubernetes/apimachinery/blob/b72d93d174332f952a8d431419fece5e6f044bcb/pkg/api/resource/quantity.go#L640-L645

@jmealo
Copy link
Author

jmealo commented Dec 10, 2025

@Zerpet: Thanks! I completely agree, and I saw the docs for cmp said they should already be doing this. To say I was surprised about the desired size of 0 is an understatement. I'm digging in on this today, checking my other clusters for the issue. So far nothing in the helm output is suspect. I think adding non-verbose logging for the desired size of 0 would be worth doing to save others the trouble (depending on what the root cause is). I've time boxed it for an hour to try to figure it out 🤞. Thank you for the quick review! 🙏

@jmealo jmealo force-pushed the feat/fix-issue-2023 branch from d7d0c01 to 6d00634 Compare December 10, 2025 15:47
@jmealo
Copy link
Author

jmealo commented Dec 10, 2025

@Zerpet I believe that I've tested my changes end-to-end, with only the CRD applied, only the operator update, and with both applied, and it seems to work as intended. I also created an ephemeral cluster to make sure that I didn't cause a regression there. Everything looks good to me!

I tested my latest changes to the CRDs (without fixing our helm boo boo):

Release "rabbitmq" does not exist. Installing it now.
Error: release rabbitmq failed, and has been uninstalled due to atomic being set: 1 error occurred:
        * RabbitmqCluster.rabbitmq.com "rabbitmq" is invalid: spec.override.statefulSet.spec.volumeClaimTemplates[0].spec: Required value


make: *** [install] Error 1

I deployed the upstream CRDs to test the new error logging and installed the broken helm chart and got the following:

Relevant logging improvement:

{"error":"PVC template 'persistence' has spec.resources.requests.storage=0 (or missing). If using override.statefulSet.spec.volumeClaimTemplates, you must provide the COMPLETE template including spec.resources.requests.storage. Overrides replace the entire volumeClaimTemplate, not merge with it"}

Full log:

{
   "level":"error",
   "ts":"2025-12-10T16:09:10Z",
   "msg":"Failed to determine PVC capacity: PVC template 'persistence' has spec.resources.requests.storage=0 (or missing). If using override.statefulSet.spec.volumeClaimTemplates, you must provide the COMPLETE template including spec.resources.requests.storage. Overrides replace the entire volumeClaimTemplate, not merge with it",
   "controller":"rabbitmqcluster",
   "controllerGroup":"rabbitmq.com",
   "controllerKind":"RabbitmqCluster",
   "RabbitmqCluster":{
      "name":"rabbitmq",
      "namespace":"rabbitmq"
   },
   "namespace":"rabbitmq",
   "name":"rabbitmq",
   "reconcileID":"297e7188-a3e5-4459-964e-48d59ab98353",
   "error":"PVC template 'persistence' has spec.resources.requests.storage=0 (or missing). If using override.statefulSet.spec.volumeClaimTemplates, you must provide the COMPLETE template including spec.resources.requests.storage. Overrides replace the entire volumeClaimTemplate, not merge with it",
   "stacktrace":"github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).reconcilePVC\n\t/workspace/controllers/reconcile_persistence.go:20\ngithub.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile\n\t/workspace/controllers/rabbitmqcluster_controller.go:228\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:461\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:296"
}

I tested creating an ephemeral RabbitMQ cluster with both the updated CRD and operator code running and it succeeded:

Cluster definition:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  annotations:
    meta.helm.sh/release-name: rabbitmq
    meta.helm.sh/release-namespace: rabbitmq
  creationTimestamp: "2025-12-10T16:30:59Z"
  finalizers:
  - deletion.finalizers.rabbitmqclusters.rabbitmq.com
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
  name: rabbitmq
  namespace: rabbitmq
  resourceVersion: "541966441"
  uid: cc427fe4-21a7-493e-91db-f30bd79904e7
spec:
  delayStartSeconds: 30
  image: rabbitmq:4.1.3-management
  imagePullSecrets:
  - name: docker-hub
  override: {}
  persistence:
    storage: "0"
    storageClassName: managed-csi-persist
  rabbitmq:
    additionalConfig: |
      cluster_partition_handling = pause_minority
      collect_statistics_interval = 10000
      disk_free_limit.relative = 1.0
      queue_master_locator = min-masters
      vm_memory_high_watermark_paging_ratio = 0.99
    additionalPlugins:
    - rabbitmq_shovel
    - rabbitmq_shovel_management
  replicas: 5
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 2Gi
  secretBackend:
    externalSecret:
      name: ""
  service:
    type: ClusterIP
  terminationGracePeriodSeconds: 604800
  tls: {}
status:
  binding:
    name: rabbitmq-default-user
  conditions:
  - lastTransitionTime: "2025-12-10T16:31:49Z"
    reason: AllPodsAreReady
    status: "True"
    type: AllReplicasReady
  - lastTransitionTime: "2025-12-10T16:31:15Z"
    reason: AtLeastOneEndpointAvailable
    status: "True"
    type: ClusterAvailable
  - lastTransitionTime: "2025-12-10T16:31:00Z"
    reason: NoWarnings
    status: "True"
    type: NoWarnings
  - lastTransitionTime: "2025-12-10T16:31:50Z"
    message: Finish reconciling
    reason: Success
    status: "True"
    type: ReconcileSuccess
  defaultUser:
    secretReference:
      keys:
        password: password
        username: username
      name: rabbitmq-default-user
      namespace: rabbitmq
    serviceReference:
      name: rabbitmq
      namespace: rabbitmq
  observedGeneration: 2

Operator logs:

{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-erlang-cookie of Type *v1.Secret","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-default-user of Type *v1.Secret","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-plugins-conf of Type *v1.ConfigMap","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server-conf of Type *v1.ConfigMap","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server of Type *v1.ServiceAccount","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-peer-discovery of Type *v1.Role","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server of Type *v1.RoleBinding","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"created resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"successfully annotated","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"7ca250fa-4927-4cb8-a63b-581b0ca22d9a"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"e39922ab-05fa-4840-a585-aba4a75efaa6"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"abef9202-249f-4ea7-a505-a7b968839fc1"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"Start reconciling","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-nodes of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq of Type *v1.Service","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}
{"level":"info","ts":"2025-12-10T16:31:00Z","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","controller":"rabbitmqcluster","controllerGroup":"rabbitmq.com","controllerKind":"RabbitmqCluster","RabbitmqCluster":{"name":"rabbitmq","namespace":"rabbitmq"},"namespace":"rabbitmq","name":"rabbitmq","reconcileID":"2c6ffa8c-e973-431e-b083-09a395af3ea3"}

Pods running:

kubectl get pod -n rabbitmq
NAME                READY   STATUS    RESTARTS   AGE
rabbitmq-server-0   1/1     Running   0          2m42s
rabbitmq-server-1   1/1     Running   0          2m42s
rabbitmq-server-2   1/1     Running   0          2m42s
rabbitmq-server-3   1/1     Running   0          2m42s
rabbitmq-server-4   1/1     Running   0          2m42s

@jmealo jmealo marked this pull request as ready for review December 10, 2025 16:53
@jmealo
Copy link
Author

jmealo commented Dec 10, 2025

I'm still trying to get the system tests to run correctly. Can we try running them in CI? 🤔

@jmealo
Copy link
Author

jmealo commented Dec 10, 2025

@Zerpet I updated my PR to address the root causes of my issues, and I think these changes provide a good UX for operator users.

Root Cause Analysis

  • Someone (likely LLM-assisted 🙃 ) added pvc-autoresizer annotations but only included metadata instead of the complete volumeClaimTemplates structure
  • When you use override.statefulSet.spec.volumeClaimTemplates, it replaces the entire template (not merges), so you need the full spec
  • The missing spec.resources.requests.storage is interpreted as 0 by the operator
  • It's not intuitive to troubleshoot operationally, as everything in Kubernetes looks correct, and only upon inspecting the helm output would you see the issue.

Symptoms:

  • The operator reconciliation loop is continuously failing (every ~15 minutes based on those logs)
  • Any changes to the RabbitMQCluster CR won't be applied (operator can't reconcile)
  • Scaling (adding/removing nodes) would likely fail or behave unexpectedly
  • The PVC autoresizer might get confused when it tries to resize
  • Helm upgrades might appear successful but changes won't take effect

Fix applied:

  • Update Operator CRDs to validate that the spec is present when you define an override
  • Add helpful logging if an invalid spec is deployed with an older version of the CRDs

@jmealo jmealo changed the title Fix Issue 2023: Use byte comparison for PVC resize decision making Fix Issue 2023: Validate VolumeClaimTemplate overrides contain spec and provide helpful error messages when they don't Dec 10, 2025
Prevents silent reconciliation failures when override.statefulSet.spec.volumeClaimTemplates
is provided with incomplete configuration (e.g., only metadata.annotations without
spec.resources.requests.storage).

Changes:
- Add CRD validation requiring PVC spec field (rejected at admission time)
- Detect and error on storage=0 with hints about override behavior
- Show actual values in shrink errors: "(existing: 20Gi, desired: 5Gi)"

Fixes rabbitmq#2023
@jmealo jmealo force-pushed the feat/fix-issue-2023 branch from 6d00634 to 1a2bb15 Compare December 10, 2025 17:24
@jmealo
Copy link
Author

jmealo commented Dec 10, 2025

Ok, I removed the test changes, requested a CLA and modified the commit to match that email address and tried to match the projects conventions. Please let me know if there's anything else I can/need to do to shepard this PR.

Thanks for all your help!

@jmealo
Copy link
Author

jmealo commented Dec 10, 2025

CLA signed

@michaelklishin
Copy link
Contributor

I confirm that we have received a signed CLA from @jmealo. Thank you for contributing!

@Zerpet Zerpet self-requested a review December 11, 2025 16:46
@Zerpet
Copy link
Member

Zerpet commented Dec 11, 2025

Thank you for investigating and contributing! I'll try to review this change tomorrow. FYI there's a expected unit test failure, since the error message has changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UX: DX: VolumeClaimTemplate overrides without a spec cause permanent reconciliation failures

3 participants