feat: SECURESIGN-3184: ensure ServiceMonitors are only created when s… #1440

knrc · 2025-11-10T15:16:37Z

…upported in the cluster

This PR adds a feature to automatically track the ServiceMonitor support within a cluster, create status events against the owners of those resources, and reconcile the resources when updated and the API is supported.

Summary by Sourcery

Add a centralized ServiceMonitor framework by introducing a registry, CRD watcher, and generic monitoring action, refactor existing controllers to leverage the new framework, and update RBAC and main.go to support ServiceMonitor API availability.

New Features:

Introduce a ServiceMonitorRegistry to centrally manage ServiceMonitor specifications and emit status events
Implement a CRDWatcher that detects ServiceMonitor CRD availability and triggers reconciliation when the API is supported
Provide a generic MonitoringAction to unify ServiceMonitor creation logic across all controllers

Enhancements:

Refactor Trillian, Rekor, CTlog, Fulcio, and TSA controllers to use the new generic monitoring framework
Update main.go to set up the CRD watcher and add RBAC permissions for watching CustomResourceDefinitions

Build:

Expand RBAC role to allow get/list/watch on apiextensions.k8s.io/customresourcedefinitions

sourcery-ai · 2025-11-10T15:16:44Z

Reviewer's Guide

This PR centralizes and streamlines ServiceMonitor support by introducing a singleton registry, a CRD watcher, and a generic monitoring action, then refactoring all individual controller monitoring implementations to leverage these new abstractions, and wiring the CRD watcher into the manager with appropriate RBAC updates.

Sequence diagram for ServiceMonitor CRD availability and reconciliation

sequenceDiagram
  participant Controller as "Controller (e.g. Trillian)"
  participant Registry as "ServiceMonitorRegistry"
  participant CRDWatcher as "CRDWatcherReconciler"
  participant K8s as "Kubernetes API"
  participant ServiceMonitor as "ServiceMonitor resource"

  Controller->>Registry: Register ServiceMonitor spec
  CRDWatcher->>K8s: Watch ServiceMonitor CRD
  K8s-->>CRDWatcher: Notify CRD change (add/remove)
  CRDWatcher->>Registry: Set API availability
  Registry->>Controller: Emit event (API available/unavailable)
  alt API available
    Registry->>ServiceMonitor: Reconcile ServiceMonitor resource
    Registry->>Controller: Emit event (reconciled)
  else API unavailable
    Registry->>Controller: Emit event (API unavailable)
  end

Class diagram for ServiceMonitorRegistry and related types

classDiagram
  class ServiceMonitorRegistry {
    - mutex: sync.RWMutex
    - specs: map[types.NamespacedName]*ServiceMonitorSpec
    - notifiedOwners: map[ownerKey]*ServiceMonitorSpec
    - client: client.Client
    - recorder: record.EventRecorder
    - logger: logr.Logger
    - apiAvailable: bool
    + Register(ctx, spec, owner)
    + GetAll()
    + SetAPIAvailable(available)
    + IsAPIAvailable()
    + EmitEventToOwners(ctx, eventType, reason, message)
    + ReconcileAll(ctx)
    + ReconcileOne(ctx, spec)
  }
  class ServiceMonitorSpec {
    + OwnerKey: types.NamespacedName
    + OwnerGVK: schema.GroupVersionKind
    + Namespace: string
    + Name: string
    + EnsureFuncs: []func(*unstructured.Unstructured) error
  }
  class CRDWatcherReconciler {
    + Client: client.Client
    + Scheme: *runtime.Scheme
    + Registry: *ServiceMonitorRegistry
    + Reconcile(ctx, req)
    + SetupWithManager(mgr)
  }
  class MonitoringConfig {
    + ComponentName: string
    + DeploymentName: string
    + MonitoringRoleName: string
    + MetricsPortName: string
    + IsMonitoringEnabled: func(T) bool
    + CustomEndpointBuilder: func(instance T) []func(*unstructured.Unstructured) error
  }
  class genericMonitoringAction {
    + config: MonitoringConfig
    + Name()
    + CanHandle(ctx, instance)
    + Handle(ctx, instance)
  }
  ServiceMonitorRegistry "1" o-- "*" ServiceMonitorSpec
  CRDWatcherReconciler "1" --> "1" ServiceMonitorRegistry
  genericMonitoringAction "1" --> "1" MonitoringConfig

Flow diagram for controller monitoring registration and reconciliation

flowchart TD
  A["Controller (e.g. Trillian)"] --> B["Create MonitoringConfig"]
  B --> C["NewMonitoringAction"]
  C --> D["Register ServiceMonitorSpec with Registry"]
  D --> E["Registry checks API availability"]
  E -->|API available| F["Reconcile ServiceMonitor"]
  E -->|API unavailable| G["Emit event: API unavailable"]
  F --> H["Emit event: ServiceMonitor reconciled"]

File-Level Changes

Change	Details	Files
Introduce centralized ServiceMonitor registry, CRD watcher, and generic monitoring action	Added registry.go with singleton ServiceMonitorRegistry, registration, event emission, and reconciliation logic Implemented crd_watcher.go to track ServiceMonitor CRD availability, emit status events, and trigger registry reconciliation Created generic_action.go to encapsulate controller-agnostic ServiceMonitor creation and registration	`internal/controller/monitoring/registry.go` `internal/controller/monitoring/crd_watcher.go` `internal/controller/monitoring/generic_action.go`
Refactor per-controller monitoring into generic action	Replaced inline CreateOrUpdate logic with monitoring.NewMonitoringAction calls Passed component, deployment, role, and custom endpoint builders into generic action config	`internal/controller/trillian/actions/logserver/monitoring.go` `internal/controller/trillian/actions/logsigner/monitoring.go` `internal/controller/rekor/actions/server/monitoring.go` `internal/controller/rekor/actions/monitor/monitoring.go` `internal/controller/ctlog/actions/monitoring.go` `internal/controller/fulcio/actions/monitoring.go` `internal/controller/tsa/actions/monitoring.go`
Wire CRD watcher into controller manager	Added apiextensionsv1 scheme registration Instantiated and set up CRDWatcher with manager event recorder and logger	`cmd/main.go`
Enhance RBAC to allow CRD watching	Granted get, list, watch permissions on customresourcedefinitions for ServiceMonitor CRD	`config/rbac/role.yaml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

qodo-code-review · 2025-11-10T15:17:06Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Action Logging: New code reconciles ServiceMonitor resources and changes API availability state but does not clearly produce structured audit logs linking actions to a user ID or equivalent actor, requiring verification whether such critical actions are captured by an existing audit trail. Referred Code // ReconcileAll creates or updates all registered ServiceMonitors func (r *ServiceMonitorRegistry) ReconcileAll(ctx context.Context) error { specs := r.GetAll() r.logger.Info("Reconciling all ServiceMonitors", "count", len(specs)) var errs []error for _, spec := range specs { if err := r.ReconcileOne(ctx, spec); err != nil { errs = append(errs, err) } } if len(errs) > 0 { return fmt.Errorf("failed to reconcile %d ServiceMonitors: %v", len(errs), errs) } return nil } // getOwnerObject retrieves the owner object for a ServiceMonitor spec ... (clipped 49 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Error Handling: Reconcile path logs and returns on errors but largely continues without surfacing actionable context to owners beyond generic messages, and ignores aggregate errors on ReconcileAll which may mask partial failures. Referred Code if available { logger.Info("ServiceMonitor API is available, reconciling all registered ServiceMonitors") if err := r.Registry.ReconcileAll(ctx); err != nil { logger.Error(err, "Failed to reconcile ServiceMonitors") } } return reconcile.Result{}, nil Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The kubernetes.IsServiceMonitorAvailable function is never used—either integrate it into the APIService watcher or remove it to avoid dead code.
Specs in the ServiceMonitorRegistry are only cleaned up when reconciliation discovers a missing owner; consider adding a watch or finalizer on the custom resource deletion to proactively remove stale specs and orphaned ServiceMonitors.
In genericMonitoringAction.Handle you always call registry.ReconcileOne even if the ServiceMonitor API isn’t available—short‐circuiting reconciliation when registry.IsAPIAvailable() is false would reduce unnecessary error logs and no-op calls.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The kubernetes.IsServiceMonitorAvailable function is never used—either integrate it into the APIService watcher or remove it to avoid dead code.
- Specs in the ServiceMonitorRegistry are only cleaned up when reconciliation discovers a missing owner; consider adding a watch or finalizer on the custom resource deletion to proactively remove stale specs and orphaned ServiceMonitors.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne even if the ServiceMonitor API isn’t available—short‐circuiting reconciliation when registry.IsAPIAvailable() is false would reduce unnecessary error logs and no-op calls.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

qodo-code-review · 2025-11-10T15:18:29Z

PR Code Suggestions ✨

Latest suggestions up to 000e511

Category	Suggestion	Impact
Possible issue	Prevent nil condition panic In `CanHandle`, add a nil check for the condition variable `c` before accessing its `Reason` field to prevent a potential panic. internal/controller/monitoring/generic_action.go [45-48] func (i genericMonitoringAction[T]) CanHandle(ctx context.Context, instance T) bool { c := meta.FindStatusCondition(instance.GetConditions(), constants.Ready) + if c == nil { + return false + } return (c.Reason == constants.Creating \|\| c.Reason == constants.Ready) && i.config.IsMonitoringEnabled(instance) } Apply / Chat* Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a potential nil pointer dereference in `CanHandle` which would cause the controller to panic if the `Ready` condition is not yet present on a resource.	Medium
General	Treat owner not-found as non-error In `getOwnerObject`, when an owner is not found, return `nil, nil` after performing cleanup to avoid generating misleading error logs. internal/controller/monitoring/registry.go [203-216] func (r ServiceMonitorRegistry) getOwnerObject(ctx context.Context, spec ServiceMonitorSpec) (client.Object, error) { owner := &unstructured.Unstructured{} owner.SetGroupVersionKind(spec.OwnerGVK) if err := r.client.Get(ctx, spec.OwnerKey, owner); err != nil { if client.IgnoreNotFound(err) == nil { r.logger.Info("Owner not found, cleaning up registry", "owner", spec.OwnerKey) r.cleanupOwner(spec) + // Not found is expected during deletions; return nil without error + return nil, nil } return nil, fmt.Errorf("failed to get owner object: %w", err) } return owner, nil } Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that returning an error when an owner is not found (a normal part of resource deletion) leads to misleading error logs, and proposes a good improvement to handle this case gracefully.	Medium
Update

Previous suggestions

✅ Suggestions up to commit 233c1da

Category	Suggestion	Impact
High-level	The detection mechanism is incorrect The current `ServiceMonitor` detection mechanism is flawed. It incorrectly watches for an `APIService` instead of the more reliable `CustomResourceDefinition` (CRD), which will cause detection to fail in standard Prometheus Operator installations. Examples: internal/controller/monitoring/apiservice_watcher.go [20] ServiceMonitorAPIServiceName = "v1.monitoring.coreos.com" cmd/main.go [223-236] apiWatcher, err := monitoring.NewAPIServiceWatcher( mgr.GetClient(), mgr.GetScheme(), mgr.GetEventRecorderFor("servicemonitor-watcher"), setupLog, ) if err != nil { setupLog.Error(err, "unable to create APIService watcher") os.Exit(1) } ... (clipped 4 lines) Solution Walkthrough: Before: // cmd/main.go // Setup APIService watcher for ServiceMonitor availability apiWatcher, err := monitoring.NewAPIServiceWatcher(...) if err := apiWatcher.SetupWithManager(mgr); err != nil { // ... } // internal/controller/monitoring/apiservice_watcher.go const ServiceMonitorAPIServiceName = "v1.monitoring.coreos.com" func (r APIServiceWatcherReconciler) Reconcile(ctx, req) (ctrl.Result, error) { if req.Name != ServiceMonitorAPIServiceName { return reconcile.Result{}, nil } apiService := &apiregistrationv1.APIService{} err := r.Get(ctx, req.NamespacedName, apiService) if client.IgnoreNotFound(err) == nil { // Assumes API is unavailable because APIService is not found r.Registry.SetAPIAvailable(false) } // ... } After: // A possible implementation using the discovery client // in main.go or a periodic checker ticker := time.NewTicker(1 time.Minute) go func() { for range ticker.C { // Use discovery client to check for the resource gvr := schema.GroupVersionResource{ Group: "monitoring.coreos.com", Version: "v1", Resource: "servicemonitors", } available, err := isResourceAvailable(discoveryClient, gvr) if err == nil { registry.SetAPIAvailable(available) if available { registry.ReconcileAll(context.Background()) } } } }() Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical design flaw where the PR checks for an `APIService` instead of a `CustomResourceDefinition` to detect `ServiceMonitor` support, which will cause the feature to fail in most standard configurations.	High
Possible issue	✅ ~~Fix potential deadlock in event emission~~ Suggestion Impact: The commit implements the proposed pattern: it removes the deferred unlock, copies r.notifiedOwners into a local slice, unlocks, and then iterates over the slice to emit events, preventing a potential deadlock. code diff: r.mutex.RLock() - defer r.mutex.RUnlock() - + specsToNotify := make([]ServiceMonitorSpec, 0, len(r.notifiedOwners)) for _, spec := range r.notifiedOwners { + specsToNotify = append(specsToNotify, spec) + } + r.mutex.RUnlock() + + for _, spec := range specsToNotify { r.emitEventToOwner(ctx, spec, eventType, reason, message) } Refactor the `EmitEventToOwners` function to prevent a potential deadlock by first collecting specs into a local slice under a read lock, then releasing the lock before iterating and emitting events. internal/controller/monitoring/registry.go [164-172] // EmitEventToOwners emits an event to all unique owner objects that have registered ServiceMonitors func (r ServiceMonitorRegistry) EmitEventToOwners(ctx context.Context, eventType, reason, message string) { r.mutex.RLock() - defer r.mutex.RUnlock() + specsToNotify := make([]*ServiceMonitorSpec, 0, len(r.notifiedOwners)) + for _, spec := range r.notifiedOwners { + specsToNotify = append(specsToNotify, spec) + } + r.mutex.RUnlock() - for _, spec := range r.notifiedOwners { + for _, spec := range specsToNotify { r.emitEventToOwner(ctx, spec, eventType, reason, message) } } `[Suggestion processed]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical potential deadlock in the `EmitEventToOwners` function and proposes a standard and effective solution to prevent it.	High
General	Simplify API availability detection logic Simplify the `Reconcile` function's logic by consolidating the `APIService` availability check. Treat a "not found" error as the API being unavailable to create a more unified control flow. internal/controller/monitoring/apiservice_watcher.go [42-92] // Reconcile processes APIService changes for the ServiceMonitor API func (r *APIServiceWatcherReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { if req.Name != ServiceMonitorAPIServiceName { return reconcile.Result{}, nil } logger := log.FromContext(ctx) logger.Info("Detected ServiceMonitor APIService change") apiService := &apiregistrationv1.APIService{} err := r.Get(ctx, req.NamespacedName, apiService) + available := false if err != nil { - if client.IgnoreNotFound(err) == nil { - logger.Info("ServiceMonitor APIService not found - Prometheus Operator not installed") - previouslyAvailable := r.Registry.IsAPIAvailable() - r.Registry.SetAPIAvailable(false) - - if previouslyAvailable { - logger.Info("ServiceMonitor API is no longer available") - r.Registry.EmitEventToOwners(ctx, "Warning", "ServiceMonitorAPIUnavailable", "ServiceMonitor API is no longer available") - } - return reconcile.Result{}, nil + if client.IgnoreNotFound(err) != nil { + logger.Error(err, "Failed to get APIService") + return reconcile.Result{}, err } - logger.Error(err, "Failed to get APIService") - return reconcile.Result{}, err + logger.Info("ServiceMonitor APIService not found - Prometheus Operator not installed") + } else { + available = isAPIServiceAvailable(apiService) } - available := isAPIServiceAvailable(apiService) previouslyAvailable := r.Registry.IsAPIAvailable() r.Registry.SetAPIAvailable(available) if available != previouslyAvailable { if available { logger.Info("ServiceMonitor API is now available") r.Registry.EmitEventToOwners(ctx, "Normal", "ServiceMonitorAPIAvailable", "ServiceMonitor API is now available") } else { logger.Info("ServiceMonitor API is no longer available") r.Registry.EmitEventToOwners(ctx, "Warning", "ServiceMonitorAPIUnavailable", "ServiceMonitor API is no longer available") } } if available { logger.Info("ServiceMonitor API is available, reconciling all registered ServiceMonitors") if err := r.Registry.ReconcileAll(ctx); err != nil { logger.Error(err, "Failed to reconcile ServiceMonitors") } } return reconcile.Result{}, nil } Suggestion importance[1-10]: 5 __ Why: The suggestion proposes a valid refactoring that simplifies the control flow in the `Reconcile` function, making the code slightly cleaner and more readable by consolidating availability checks.	Low

knrc · 2025-11-10T16:38:31Z

Note: following the qodo review I switched the code from monitoring the APIService to monitoring the CRD. This removes the race between the APIService being created and the CRD being created.

knrc · 2025-11-10T16:57:07Z

@sourcery-ai review

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The ServiceMonitorRegistry is implemented as a package‐level singleton, which can make tests and interdependent reconcilers harder to manage—consider injecting or scoping the registry per‐manager instead of using a global instance.
In genericMonitoringAction.Handle you always call registry.ReconcileOne (which errors when the CRD is absent) and just log the error—consider checking registry.IsAPIAvailable first or surfacing a retryable error so the action doesn’t silently skip reconciliation.
Currently owner cleanup only happens when a Get fails in getOwnerObject; you may want to actively watch owner deletions or add a finalizer to proactively deregister specs rather than waiting for lazy garbage‐collection.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The ServiceMonitorRegistry is implemented as a package‐level singleton, which can make tests and interdependent reconcilers harder to manage—consider injecting or scoping the registry per‐manager instead of using a global instance.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne (which errors when the CRD is absent) and just log the error—consider checking registry.IsAPIAvailable first or surfacing a retryable error so the action doesn’t silently skip reconciliation.
- Currently owner cleanup only happens when a Get fails in getOwnerObject; you may want to actively watch owner deletions or add a finalizer to proactively deregister specs rather than waiting for lazy garbage‐collection.

## Individual Comments

### Comment 1
<location> `internal/controller/monitoring/registry.go:123` </location>
<code_context>
+}
+
+// cleanupOwner removes all specs for an owner and clears it from notifiedOwners
+func (r *ServiceMonitorRegistry) cleanupOwner(spec *ServiceMonitorSpec) {
+	r.mutex.Lock()
+	defer r.mutex.Unlock()
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Cleanup logic may leave orphaned specs if multiple ServiceMonitors share the same owner.

Currently, removing a spec only deletes that spec and the owner from notifiedOwners, potentially leaving other specs for the same owner. Please update the logic to remove all specs associated with the owner, or document that only one spec per owner is allowed.

Suggested implementation:

```golang
 // cleanupOwner removes all specs for an owner and clears it from notifiedOwners

```

```golang
func (r *ServiceMonitorRegistry) cleanupOwner(spec *ServiceMonitorSpec) {
	r.mutex.Lock()
	defer r.mutex.Unlock()

	ownerKey := ownerKey{NamespacedName: spec.OwnerKey, GVK: spec.OwnerGVK}

	// Remove all specs associated with this owner
	for key, s := range r.specs {
		if s.OwnerKey == spec.OwnerKey && s.OwnerGVK == spec.OwnerGVK {
			delete(r.specs, key)
		}
	}

	// Remove owner from notifiedOwners
	delete(r.notifiedOwners, ownerKey)
}

```
</issue_to_address>

### Comment 2
<location> `internal/controller/monitoring/registry.go:165` </location>
<code_context>
+}
+
+// EmitEventToOwners emits an event to all unique owner objects that have registered ServiceMonitors
+func (r *ServiceMonitorRegistry) EmitEventToOwners(ctx context.Context, eventType, reason, message string) {
+	r.mutex.RLock()
+	specsToNotify := make([]*ServiceMonitorSpec, 0, len(r.notifiedOwners))
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Emitting events to owners may fail silently if owner objects are deleted.

Consider whether errors from deleted owner objects should trigger additional cleanup or notification to maintain registry consistency with cluster state.

Suggested implementation:

```golang
	for _, spec := range specsToNotify {
		err := r.emitEventToOwner(ctx, spec, eventType, reason, message)
		if err != nil {
			// Check if error is due to owner object being deleted (NotFound)
			if errors.IsNotFound(err) {
				// Remove owner from registry to maintain consistency
				r.mutex.Lock()
				delete(r.notifiedOwners, spec.OwnerKey())
				r.mutex.Unlock()
				// Optionally log the cleanup
				klog.Infof("Removed deleted owner %s from ServiceMonitorRegistry", spec.OwnerKey())
			} else {
				// Optionally log other errors
				klog.Errorf("Failed to emit event to owner %s: %v", spec.OwnerKey(), err)
			}
		}
	}
}

```

1. Ensure that `emitEventToOwner` returns an error (update its signature and implementation if necessary).
2. Import `"k8s.io/apimachinery/pkg/api/errors"` and `"k8s.io/klog/v2"` if not already present.
3. Implement or verify the existence of `OwnerKey()` on `ServiceMonitorSpec` for unique identification.
</issue_to_address>

### Comment 3
<location> `internal/controller/monitoring/registry.go:229` </location>
<code_context>
+}
+
+// ReconcileOne creates or updates a single ServiceMonitor
+func (r *ServiceMonitorRegistry) ReconcileOne(ctx context.Context, spec *ServiceMonitorSpec) error {
+	if !r.IsAPIAvailable() {
+		return fmt.Errorf("ServiceMonitor API not available")
</code_context>

<issue_to_address>
**issue (bug_risk):** ReconcileOne does not check for nil EnsureFuncs, which could cause a panic.

Add a check to ensure spec.EnsureFuncs is not nil before invoking CreateOrUpdate to prevent a potential panic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

internal/action/monitoring/registry.go

osmman

This PR introduces unnecessary complexity with stateful registries and CRD watchers. The implementation doesn't follow Kubernetes operator best practices and creates situations where monitoring.enabled: true succeeds even when monitoring isn't actually enabled (not reflecting actual state of the system).

Critical issues:

1. Configuration mismatch
When monitoring.enabled: true but ServiceMonitor CRD doesn't exist, reconciliation returns success. The action always returns i.Continue() regardless of whether ServiceMonitor was created.

This means:

Status shows Ready: true
Spec declares monitoring.enabled: true
But monitoring is NOT actually enabled ❌

Expected behavior:

If monitoring.enabled: true and ServiceMonitor CRD is not available, reconciliation should fail with an error
The Status condition should indicate the problem (e.g., reason: ServiceMonitorCRDNotAvailable)
An event should be emitted explaining the issue
Reconciliation should retry until the user either:
- Installs the ServiceMonitor CRD, or
- Sets monitoring.enabled: false

2. Stateful registry pattern
The implementation uses stateful in-memory registries (ServiceMonitorRegistry). Operators should be stateless and rely on the Kubernetes API for current state.

Using singleton patter makes the code harder to test.

3. Permissions and CRD reconciler
The implementation requires cluster-wide permissions to watch all CRDs, which increases the operator's privilege footprint.

Simply attempt to create the ServiceMonitor. If the API is not available, the Kubernetes API server will return an error. Handle that error appropriately and update the Status.

Proposed solution:

1. improve error handling
Improve error handling and messages for actions to identify missing ServiceMonitor CRD and produce well informed error message in CRD status, Warning event and Operator log.

2. Implement monitoring cleanup functionality
Current implementation has disabled transition of monitoring.enabled field false → true via validation rules. That limitation will cause issues for users to fix broken instances (unfortunately default values is True), we need to implement cleanup logic to delete ServiceMonitor resources when monitoring is disabled.

3. Improve API documentation
I think that current problem is lack of clear documentation for monitoring.enabled field. We should inform that when enabled it will create ServiceMonitor resources to expose metrics for Prometheus Operator to scrape and explain requirement when enabled.

Expected user experience after changes

# Scenario: User creates instance without Prometheus Operator
$ kubectl apply -f fulcio.yaml

$ kubectl get fulcio my-fulcio
NAME        READY   REASON
my-fulcio   False   ServiceMonitorCRDNotAvailable

# Clear error message in status
$ kubectl get fulcio my-fulcio -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
ServiceMonitor CRD is not installed. To resolve: (1) Install Prometheus Operator, or (2) Set spec.monitoring.enabled=false

# User can easily fix by disabling monitoring
$ kubectl patch fulcio my-fulcio --type=merge -p '{"spec":{"monitoring":{"enabled":false}}}'
# Reconciliation succeeds, ServiceMonitor deleted

# Or by installing Prometheus Operator
$ kubectl apply -f prometheus-operator.yaml
# Next reconciliation succeeds, ServiceMonitor created

osmman · 2025-11-11T17:14:12Z

internal/action/monitoring/generic_action.go

Move to internal/action/monitoring/ to be consistent with other shared actions in the codebase. Add unit tests to verify the generic behavior.

In response to the AI feedback
1 - I don't believe reconciliation should fail, since it's more important for TAS to be functional than to wait for the monitoring. Given this, I believe returning i.Continue is the right response, along with the events giving the current status. The events on the owned resources will inform the user about the current status, and the code will create the ServiceMonitor if/when prometheus is installed.
2 - agreed about the singleton pattern, this can be changed to ease testing. The internal data is driven by the watches, and is intended to reduce API usage.
3 - creating a resource is not a replacement for watching CRDs. Creating a resource will require polling/retries whereas the watch reacts dynamically to the creation of the CRD.

In summary to AI feedback, I'll handle the singleton pattern and push it up again.

In response to personal feedback, I can move to internal/action/monitoring and add the unit tests

Also, in response to the AI feedback for 1.

Having monitoring work, without the ServiceMonitor, is desirable and should be a requirement. The metrics ports should be there so they can be consumed by prometheus alternatives.

@osmman I've changed the pattern so the registry is now injected from main, and added tests covering the new functionality. I also moved the code into the package you suggested. Let me know what you think

…upported in the cluster Signed-off-by: Kevin Conner <[email protected]>

qodo-code-review bot added the Review effort 3/5 label Nov 10, 2025

sourcery-ai bot reviewed Nov 10, 2025

View reviewed changes

knrc force-pushed the securesign-3184 branch 2 times, most recently from ebedc9e to 000e511 Compare November 10, 2025 16:33

sourcery-ai bot reviewed Nov 10, 2025

View reviewed changes

internal/action/monitoring/registry.go Show resolved Hide resolved

internal/action/monitoring/registry.go Show resolved Hide resolved

internal/action/monitoring/registry.go Show resolved Hide resolved

knrc requested a review from osmman November 10, 2025 17:10

osmman requested changes Nov 11, 2025

View reviewed changes

knrc force-pushed the securesign-3184 branch from 000e511 to ecccf74 Compare November 13, 2025 15:26

knrc requested a review from osmman November 13, 2025 15:33

knrc force-pushed the securesign-3184 branch from ecccf74 to 6f05e7d Compare November 13, 2025 15:46

feat: SECURESIGN-3184: ensure ServiceMonitors are only created when s…

4510d9e

…upported in the cluster Signed-off-by: Kevin Conner <[email protected]>

knrc force-pushed the securesign-3184 branch from 6f05e7d to 4510d9e Compare November 13, 2025 16:12

knrc mentioned this pull request Nov 25, 2025

feat: SECURESIGN-3184: ensure ServiceMonitors are only created when r… #1466

Open

feat: SECURESIGN-3184: ensure ServiceMonitors are only created when s… #1440

Are you sure you want to change the base?

feat: SECURESIGN-3184: ensure ServiceMonitors are only created when s… #1440

Uh oh!

Conversation

knrc commented Nov 10, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for ServiceMonitor CRD availability and reconciliation

Class diagram for ServiceMonitorRegistry and related types

Flow diagram for controller monitoring registration and reconciliation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

qodo-code-review bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

knrc commented Nov 10, 2025

Uh oh!

knrc commented Nov 10, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

osmman left a comment

Choose a reason for hiding this comment

Proposed solution:

Expected user experience after changes

Uh oh!

osmman Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

knrc Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

knrc Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

knrc Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

knrc commented Nov 10, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 10, 2025 •

edited

Loading

qodo-code-review bot commented Nov 10, 2025 •

edited

Loading

qodo-code-review bot commented Nov 10, 2025 •

edited

Loading