Skip to content

Conversation

@knrc
Copy link
Contributor

@knrc knrc commented Nov 10, 2025

…upported in the cluster

This PR adds a feature to automatically track the ServiceMonitor support within a cluster, create status events against the owners of those resources, and reconcile the resources when updated and the API is supported.

Summary by Sourcery

Add a centralized ServiceMonitor framework by introducing a registry, CRD watcher, and generic monitoring action, refactor existing controllers to leverage the new framework, and update RBAC and main.go to support ServiceMonitor API availability.

New Features:

  • Introduce a ServiceMonitorRegistry to centrally manage ServiceMonitor specifications and emit status events
  • Implement a CRDWatcher that detects ServiceMonitor CRD availability and triggers reconciliation when the API is supported
  • Provide a generic MonitoringAction to unify ServiceMonitor creation logic across all controllers

Enhancements:

  • Refactor Trillian, Rekor, CTlog, Fulcio, and TSA controllers to use the new generic monitoring framework
  • Update main.go to set up the CRD watcher and add RBAC permissions for watching CustomResourceDefinitions

Build:

  • Expand RBAC role to allow get/list/watch on apiextensions.k8s.io/customresourcedefinitions

@sourcery-ai
Copy link

sourcery-ai bot commented Nov 10, 2025

Reviewer's Guide

This PR centralizes and streamlines ServiceMonitor support by introducing a singleton registry, a CRD watcher, and a generic monitoring action, then refactoring all individual controller monitoring implementations to leverage these new abstractions, and wiring the CRD watcher into the manager with appropriate RBAC updates.

Sequence diagram for ServiceMonitor CRD availability and reconciliation

sequenceDiagram
  participant Controller as "Controller (e.g. Trillian)"
  participant Registry as "ServiceMonitorRegistry"
  participant CRDWatcher as "CRDWatcherReconciler"
  participant K8s as "Kubernetes API"
  participant ServiceMonitor as "ServiceMonitor resource"

  Controller->>Registry: Register ServiceMonitor spec
  CRDWatcher->>K8s: Watch ServiceMonitor CRD
  K8s-->>CRDWatcher: Notify CRD change (add/remove)
  CRDWatcher->>Registry: Set API availability
  Registry->>Controller: Emit event (API available/unavailable)
  alt API available
    Registry->>ServiceMonitor: Reconcile ServiceMonitor resource
    Registry->>Controller: Emit event (reconciled)
  else API unavailable
    Registry->>Controller: Emit event (API unavailable)
  end
Loading

Class diagram for ServiceMonitorRegistry and related types

classDiagram
  class ServiceMonitorRegistry {
    - mutex: sync.RWMutex
    - specs: map[types.NamespacedName]*ServiceMonitorSpec
    - notifiedOwners: map[ownerKey]*ServiceMonitorSpec
    - client: client.Client
    - recorder: record.EventRecorder
    - logger: logr.Logger
    - apiAvailable: bool
    + Register(ctx, spec, owner)
    + GetAll()
    + SetAPIAvailable(available)
    + IsAPIAvailable()
    + EmitEventToOwners(ctx, eventType, reason, message)
    + ReconcileAll(ctx)
    + ReconcileOne(ctx, spec)
  }
  class ServiceMonitorSpec {
    + OwnerKey: types.NamespacedName
    + OwnerGVK: schema.GroupVersionKind
    + Namespace: string
    + Name: string
    + EnsureFuncs: []func(*unstructured.Unstructured) error
  }
  class CRDWatcherReconciler {
    + Client: client.Client
    + Scheme: *runtime.Scheme
    + Registry: *ServiceMonitorRegistry
    + Reconcile(ctx, req)
    + SetupWithManager(mgr)
  }
  class MonitoringConfig {
    + ComponentName: string
    + DeploymentName: string
    + MonitoringRoleName: string
    + MetricsPortName: string
    + IsMonitoringEnabled: func(T) bool
    + CustomEndpointBuilder: func(instance T) []func(*unstructured.Unstructured) error
  }
  class genericMonitoringAction {
    + config: MonitoringConfig
    + Name()
    + CanHandle(ctx, instance)
    + Handle(ctx, instance)
  }
  ServiceMonitorRegistry "1" o-- "*" ServiceMonitorSpec
  CRDWatcherReconciler "1" --> "1" ServiceMonitorRegistry
  genericMonitoringAction "1" --> "1" MonitoringConfig
Loading

Flow diagram for controller monitoring registration and reconciliation

flowchart TD
  A["Controller (e.g. Trillian)"] --> B["Create MonitoringConfig"]
  B --> C["NewMonitoringAction"]
  C --> D["Register ServiceMonitorSpec with Registry"]
  D --> E["Registry checks API availability"]
  E -->|API available| F["Reconcile ServiceMonitor"]
  E -->|API unavailable| G["Emit event: API unavailable"]
  F --> H["Emit event: ServiceMonitor reconciled"]
Loading

File-Level Changes

Change Details Files
Introduce centralized ServiceMonitor registry, CRD watcher, and generic monitoring action
  • Added registry.go with singleton ServiceMonitorRegistry, registration, event emission, and reconciliation logic
  • Implemented crd_watcher.go to track ServiceMonitor CRD availability, emit status events, and trigger registry reconciliation
  • Created generic_action.go to encapsulate controller-agnostic ServiceMonitor creation and registration
internal/controller/monitoring/registry.go
internal/controller/monitoring/crd_watcher.go
internal/controller/monitoring/generic_action.go
Refactor per-controller monitoring into generic action
  • Replaced inline CreateOrUpdate logic with monitoring.NewMonitoringAction calls
  • Passed component, deployment, role, and custom endpoint builders into generic action config
internal/controller/trillian/actions/logserver/monitoring.go
internal/controller/trillian/actions/logsigner/monitoring.go
internal/controller/rekor/actions/server/monitoring.go
internal/controller/rekor/actions/monitor/monitoring.go
internal/controller/ctlog/actions/monitoring.go
internal/controller/fulcio/actions/monitoring.go
internal/controller/tsa/actions/monitoring.go
Wire CRD watcher into controller manager
  • Added apiextensionsv1 scheme registration
  • Instantiated and set up CRDWatcher with manager event recorder and logger
cmd/main.go
Enhance RBAC to allow CRD watching
  • Granted get, list, watch permissions on customresourcedefinitions for ServiceMonitor CRD
config/rbac/role.yaml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@qodo-code-review
Copy link

qodo-code-review bot commented Nov 10, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Action Logging: New code reconciles ServiceMonitor resources and changes API availability state but does
not clearly produce structured audit logs linking actions to a user ID or equivalent
actor, requiring verification whether such critical actions are captured by an existing
audit trail.

Referred Code
// ReconcileAll creates or updates all registered ServiceMonitors
func (r *ServiceMonitorRegistry) ReconcileAll(ctx context.Context) error {
	specs := r.GetAll()

	r.logger.Info("Reconciling all ServiceMonitors", "count", len(specs))

	var errs []error
	for _, spec := range specs {
		if err := r.ReconcileOne(ctx, spec); err != nil {
			errs = append(errs, err)
		}
	}

	if len(errs) > 0 {
		return fmt.Errorf("failed to reconcile %d ServiceMonitors: %v", len(errs), errs)
	}

	return nil
}

// getOwnerObject retrieves the owner object for a ServiceMonitor spec


 ... (clipped 49 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Error Handling: Reconcile path logs and returns on errors but largely continues without surfacing
actionable context to owners beyond generic messages, and ignores aggregate errors on
ReconcileAll which may mask partial failures.

Referred Code
if available {
	logger.Info("ServiceMonitor API is available, reconciling all registered ServiceMonitors")
	if err := r.Registry.ReconcileAll(ctx); err != nil {
		logger.Error(err, "Failed to reconcile ServiceMonitors")
	}
}

return reconcile.Result{}, nil

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The kubernetes.IsServiceMonitorAvailable function is never used—either integrate it into the APIService watcher or remove it to avoid dead code.
  • Specs in the ServiceMonitorRegistry are only cleaned up when reconciliation discovers a missing owner; consider adding a watch or finalizer on the custom resource deletion to proactively remove stale specs and orphaned ServiceMonitors.
  • In genericMonitoringAction.Handle you always call registry.ReconcileOne even if the ServiceMonitor API isn’t available—short‐circuiting reconciliation when registry.IsAPIAvailable() is false would reduce unnecessary error logs and no-op calls.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The kubernetes.IsServiceMonitorAvailable function is never used—either integrate it into the APIService watcher or remove it to avoid dead code.
- Specs in the ServiceMonitorRegistry are only cleaned up when reconciliation discovers a missing owner; consider adding a watch or finalizer on the custom resource deletion to proactively remove stale specs and orphaned ServiceMonitors.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne even if the ServiceMonitor API isn’t available—short‐circuiting reconciliation when registry.IsAPIAvailable() is false would reduce unnecessary error logs and no-op calls.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@qodo-code-review
Copy link

qodo-code-review bot commented Nov 10, 2025

PR Code Suggestions ✨

Latest suggestions up to 000e511

CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent nil condition panic

In CanHandle, add a nil check for the condition variable c before accessing its
Reason field to prevent a potential panic.

internal/controller/monitoring/generic_action.go [45-48]

 func (i *genericMonitoringAction[T]) CanHandle(ctx context.Context, instance T) bool {
 	c := meta.FindStatusCondition(instance.GetConditions(), constants.Ready)
+	if c == nil {
+		return false
+	}
 	return (c.Reason == constants.Creating || c.Reason == constants.Ready) && i.config.IsMonitoringEnabled(instance)
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential nil pointer dereference in CanHandle which would cause the controller to panic if the Ready condition is not yet present on a resource.

Medium
General
Treat owner not-found as non-error

In getOwnerObject, when an owner is not found, return nil, nil after performing
cleanup to avoid generating misleading error logs.

internal/controller/monitoring/registry.go [203-216]

 func (r *ServiceMonitorRegistry) getOwnerObject(ctx context.Context, spec *ServiceMonitorSpec) (client.Object, error) {
 	owner := &unstructured.Unstructured{}
 	owner.SetGroupVersionKind(spec.OwnerGVK)
 
 	if err := r.client.Get(ctx, spec.OwnerKey, owner); err != nil {
 		if client.IgnoreNotFound(err) == nil {
 			r.logger.Info("Owner not found, cleaning up registry", "owner", spec.OwnerKey)
 			r.cleanupOwner(spec)
+			// Not found is expected during deletions; return nil without error
+			return nil, nil
 		}
 		return nil, fmt.Errorf("failed to get owner object: %w", err)
 	}
 
 	return owner, nil
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that returning an error when an owner is not found (a normal part of resource deletion) leads to misleading error logs, and proposes a good improvement to handle this case gracefully.

Medium
  • Update

Previous suggestions

✅ Suggestions up to commit 233c1da
CategorySuggestion                                                                                                                                    Impact
High-level
The detection mechanism is incorrect

The current ServiceMonitor detection mechanism is flawed. It incorrectly watches
for an APIService instead of the more reliable CustomResourceDefinition (CRD),
which will cause detection to fail in standard Prometheus Operator
installations.

Examples:

internal/controller/monitoring/apiservice_watcher.go [20]
	ServiceMonitorAPIServiceName = "v1.monitoring.coreos.com"
cmd/main.go [223-236]
	apiWatcher, err := monitoring.NewAPIServiceWatcher(
		mgr.GetClient(),
		mgr.GetScheme(),
		mgr.GetEventRecorderFor("servicemonitor-watcher"),
		setupLog,
	)
	if err != nil {
		setupLog.Error(err, "unable to create APIService watcher")
		os.Exit(1)
	}

 ... (clipped 4 lines)

Solution Walkthrough:

Before:

// cmd/main.go
// Setup APIService watcher for ServiceMonitor availability
apiWatcher, err := monitoring.NewAPIServiceWatcher(...)
if err := apiWatcher.SetupWithManager(mgr); err != nil {
    // ...
}

// internal/controller/monitoring/apiservice_watcher.go
const ServiceMonitorAPIServiceName = "v1.monitoring.coreos.com"

func (r *APIServiceWatcherReconciler) Reconcile(ctx, req) (ctrl.Result, error) {
    if req.Name != ServiceMonitorAPIServiceName {
        return reconcile.Result{}, nil
    }
    apiService := &apiregistrationv1.APIService{}
    err := r.Get(ctx, req.NamespacedName, apiService)
    if client.IgnoreNotFound(err) == nil {
        // Assumes API is unavailable because APIService is not found
        r.Registry.SetAPIAvailable(false)
    }
    // ...
}

After:

// A possible implementation using the discovery client

// in main.go or a periodic checker
ticker := time.NewTicker(1 * time.Minute)
go func() {
    for range ticker.C {
        // Use discovery client to check for the resource
        gvr := schema.GroupVersionResource{
            Group:    "monitoring.coreos.com",
            Version:  "v1",
            Resource: "servicemonitors",
        }
        available, err := isResourceAvailable(discoveryClient, gvr)
        if err == nil {
            registry.SetAPIAvailable(available)
            if available {
                registry.ReconcileAll(context.Background())
            }
        }
    }
}()
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical design flaw where the PR checks for an APIService instead of a CustomResourceDefinition to detect ServiceMonitor support, which will cause the feature to fail in most standard configurations.

High
Possible issue
Fix potential deadlock in event emission
Suggestion Impact:The commit implements the proposed pattern: it removes the deferred unlock, copies r.notifiedOwners into a local slice, unlocks, and then iterates over the slice to emit events, preventing a potential deadlock.

code diff:

 	r.mutex.RLock()
-	defer r.mutex.RUnlock()
-
+	specsToNotify := make([]*ServiceMonitorSpec, 0, len(r.notifiedOwners))
 	for _, spec := range r.notifiedOwners {
+		specsToNotify = append(specsToNotify, spec)
+	}
+	r.mutex.RUnlock()
+
+	for _, spec := range specsToNotify {
 		r.emitEventToOwner(ctx, spec, eventType, reason, message)
 	}

Refactor the EmitEventToOwners function to prevent a potential deadlock by first
collecting specs into a local slice under a read lock, then releasing the lock
before iterating and emitting events.

internal/controller/monitoring/registry.go [164-172]

 // EmitEventToOwners emits an event to all unique owner objects that have registered ServiceMonitors
 func (r *ServiceMonitorRegistry) EmitEventToOwners(ctx context.Context, eventType, reason, message string) {
 	r.mutex.RLock()
-	defer r.mutex.RUnlock()
+	specsToNotify := make([]*ServiceMonitorSpec, 0, len(r.notifiedOwners))
+	for _, spec := range r.notifiedOwners {
+		specsToNotify = append(specsToNotify, spec)
+	}
+	r.mutex.RUnlock()
 
-	for _, spec := range r.notifiedOwners {
+	for _, spec := range specsToNotify {
 		r.emitEventToOwner(ctx, spec, eventType, reason, message)
 	}
 }

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical potential deadlock in the EmitEventToOwners function and proposes a standard and effective solution to prevent it.

High
General
Simplify API availability detection logic

Simplify the Reconcile function's logic by consolidating the APIService
availability check. Treat a "not found" error as the API being unavailable to
create a more unified control flow.

internal/controller/monitoring/apiservice_watcher.go [42-92]

 // Reconcile processes APIService changes for the ServiceMonitor API
 func (r *APIServiceWatcherReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
 	if req.Name != ServiceMonitorAPIServiceName {
 		return reconcile.Result{}, nil
 	}
 
 	logger := log.FromContext(ctx)
 	logger.Info("Detected ServiceMonitor APIService change")
 
 	apiService := &apiregistrationv1.APIService{}
 	err := r.Get(ctx, req.NamespacedName, apiService)
 
+	available := false
 	if err != nil {
-		if client.IgnoreNotFound(err) == nil {
-			logger.Info("ServiceMonitor APIService not found - Prometheus Operator not installed")
-			previouslyAvailable := r.Registry.IsAPIAvailable()
-			r.Registry.SetAPIAvailable(false)
-
-			if previouslyAvailable {
-				logger.Info("ServiceMonitor API is no longer available")
-				r.Registry.EmitEventToOwners(ctx, "Warning", "ServiceMonitorAPIUnavailable", "ServiceMonitor API is no longer available")
-			}
-			return reconcile.Result{}, nil
+		if client.IgnoreNotFound(err) != nil {
+			logger.Error(err, "Failed to get APIService")
+			return reconcile.Result{}, err
 		}
-		logger.Error(err, "Failed to get APIService")
-		return reconcile.Result{}, err
+		logger.Info("ServiceMonitor APIService not found - Prometheus Operator not installed")
+	} else {
+		available = isAPIServiceAvailable(apiService)
 	}
 
-	available := isAPIServiceAvailable(apiService)
 	previouslyAvailable := r.Registry.IsAPIAvailable()
 	r.Registry.SetAPIAvailable(available)
 
 	if available != previouslyAvailable {
 		if available {
 			logger.Info("ServiceMonitor API is now available")
 			r.Registry.EmitEventToOwners(ctx, "Normal", "ServiceMonitorAPIAvailable", "ServiceMonitor API is now available")
 		} else {
 			logger.Info("ServiceMonitor API is no longer available")
 			r.Registry.EmitEventToOwners(ctx, "Warning", "ServiceMonitorAPIUnavailable", "ServiceMonitor API is no longer available")
 		}
 	}
 
 	if available {
 		logger.Info("ServiceMonitor API is available, reconciling all registered ServiceMonitors")
 		if err := r.Registry.ReconcileAll(ctx); err != nil {
 			logger.Error(err, "Failed to reconcile ServiceMonitors")
 		}
 	}
 
 	return reconcile.Result{}, nil
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion proposes a valid refactoring that simplifies the control flow in the Reconcile function, making the code slightly cleaner and more readable by consolidating availability checks.

Low

@knrc knrc force-pushed the securesign-3184 branch 2 times, most recently from ebedc9e to 000e511 Compare November 10, 2025 16:33
@knrc
Copy link
Contributor Author

knrc commented Nov 10, 2025

Note: following the qodo review I switched the code from monitoring the APIService to monitoring the CRD. This removes the race between the APIService being created and the CRD being created.

@knrc
Copy link
Contributor Author

knrc commented Nov 10, 2025

@sourcery-ai review

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The ServiceMonitorRegistry is implemented as a package‐level singleton, which can make tests and interdependent reconcilers harder to manage—consider injecting or scoping the registry per‐manager instead of using a global instance.
  • In genericMonitoringAction.Handle you always call registry.ReconcileOne (which errors when the CRD is absent) and just log the error—consider checking registry.IsAPIAvailable first or surfacing a retryable error so the action doesn’t silently skip reconciliation.
  • Currently owner cleanup only happens when a Get fails in getOwnerObject; you may want to actively watch owner deletions or add a finalizer to proactively deregister specs rather than waiting for lazy garbage‐collection.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The ServiceMonitorRegistry is implemented as a package‐level singleton, which can make tests and interdependent reconcilers harder to manage—consider injecting or scoping the registry per‐manager instead of using a global instance.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne (which errors when the CRD is absent) and just log the error—consider checking registry.IsAPIAvailable first or surfacing a retryable error so the action doesn’t silently skip reconciliation.
- Currently owner cleanup only happens when a Get fails in getOwnerObject; you may want to actively watch owner deletions or add a finalizer to proactively deregister specs rather than waiting for lazy garbage‐collection.

## Individual Comments

### Comment 1
<location> `internal/controller/monitoring/registry.go:123` </location>
<code_context>
+}
+
+// cleanupOwner removes all specs for an owner and clears it from notifiedOwners
+func (r *ServiceMonitorRegistry) cleanupOwner(spec *ServiceMonitorSpec) {
+	r.mutex.Lock()
+	defer r.mutex.Unlock()
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Cleanup logic may leave orphaned specs if multiple ServiceMonitors share the same owner.

Currently, removing a spec only deletes that spec and the owner from notifiedOwners, potentially leaving other specs for the same owner. Please update the logic to remove all specs associated with the owner, or document that only one spec per owner is allowed.

Suggested implementation:

```golang
 // cleanupOwner removes all specs for an owner and clears it from notifiedOwners

```

```golang
func (r *ServiceMonitorRegistry) cleanupOwner(spec *ServiceMonitorSpec) {
	r.mutex.Lock()
	defer r.mutex.Unlock()

	ownerKey := ownerKey{NamespacedName: spec.OwnerKey, GVK: spec.OwnerGVK}

	// Remove all specs associated with this owner
	for key, s := range r.specs {
		if s.OwnerKey == spec.OwnerKey && s.OwnerGVK == spec.OwnerGVK {
			delete(r.specs, key)
		}
	}

	// Remove owner from notifiedOwners
	delete(r.notifiedOwners, ownerKey)
}

```
</issue_to_address>

### Comment 2
<location> `internal/controller/monitoring/registry.go:165` </location>
<code_context>
+}
+
+// EmitEventToOwners emits an event to all unique owner objects that have registered ServiceMonitors
+func (r *ServiceMonitorRegistry) EmitEventToOwners(ctx context.Context, eventType, reason, message string) {
+	r.mutex.RLock()
+	specsToNotify := make([]*ServiceMonitorSpec, 0, len(r.notifiedOwners))
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Emitting events to owners may fail silently if owner objects are deleted.

Consider whether errors from deleted owner objects should trigger additional cleanup or notification to maintain registry consistency with cluster state.

Suggested implementation:

```golang
	for _, spec := range specsToNotify {
		err := r.emitEventToOwner(ctx, spec, eventType, reason, message)
		if err != nil {
			// Check if error is due to owner object being deleted (NotFound)
			if errors.IsNotFound(err) {
				// Remove owner from registry to maintain consistency
				r.mutex.Lock()
				delete(r.notifiedOwners, spec.OwnerKey())
				r.mutex.Unlock()
				// Optionally log the cleanup
				klog.Infof("Removed deleted owner %s from ServiceMonitorRegistry", spec.OwnerKey())
			} else {
				// Optionally log other errors
				klog.Errorf("Failed to emit event to owner %s: %v", spec.OwnerKey(), err)
			}
		}
	}
}

```

1. Ensure that `emitEventToOwner` returns an error (update its signature and implementation if necessary).
2. Import `"k8s.io/apimachinery/pkg/api/errors"` and `"k8s.io/klog/v2"` if not already present.
3. Implement or verify the existence of `OwnerKey()` on `ServiceMonitorSpec` for unique identification.
</issue_to_address>

### Comment 3
<location> `internal/controller/monitoring/registry.go:229` </location>
<code_context>
+}
+
+// ReconcileOne creates or updates a single ServiceMonitor
+func (r *ServiceMonitorRegistry) ReconcileOne(ctx context.Context, spec *ServiceMonitorSpec) error {
+	if !r.IsAPIAvailable() {
+		return fmt.Errorf("ServiceMonitor API not available")
</code_context>

<issue_to_address>
**issue (bug_risk):** ReconcileOne does not check for nil EnsureFuncs, which could cause a panic.

Add a check to ensure spec.EnsureFuncs is not nil before invoking CreateOrUpdate to prevent a potential panic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@knrc knrc requested a review from osmman November 10, 2025 17:10
Copy link
Collaborator

@osmman osmman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces unnecessary complexity with stateful registries and CRD watchers. The implementation doesn't follow Kubernetes operator best practices and creates situations where monitoring.enabled: true succeeds even when monitoring isn't actually enabled (not reflecting actual state of the system).

Critical issues:

1. Configuration mismatch
When monitoring.enabled: true but ServiceMonitor CRD doesn't exist, reconciliation returns success. The action always returns i.Continue() regardless of whether ServiceMonitor was created.

This means:

  • Status shows Ready: true
  • Spec declares monitoring.enabled: true
  • But monitoring is NOT actually enabled ❌

Expected behavior:

  • If monitoring.enabled: true and ServiceMonitor CRD is not available, reconciliation should fail with an error
  • The Status condition should indicate the problem (e.g., reason: ServiceMonitorCRDNotAvailable)
  • An event should be emitted explaining the issue
  • Reconciliation should retry until the user either:
    • Installs the ServiceMonitor CRD, or
    • Sets monitoring.enabled: false

2. Stateful registry pattern
The implementation uses stateful in-memory registries (ServiceMonitorRegistry). Operators should be stateless and rely on the Kubernetes API for current state.

Using singleton patter makes the code harder to test.

3. Permissions and CRD reconciler
The implementation requires cluster-wide permissions to watch all CRDs, which increases the operator's privilege footprint.

Simply attempt to create the ServiceMonitor. If the API is not available, the Kubernetes API server will return an error. Handle that error appropriately and update the Status.

Proposed solution:

1. improve error handling
Improve error handling and messages for actions to identify missing ServiceMonitor CRD and produce well informed error message in CRD status, Warning event and Operator log.

2. Implement monitoring cleanup functionality
Current implementation has disabled transition of monitoring.enabled field false → true via validation rules. That limitation will cause issues for users to fix broken instances (unfortunately default values is True), we need to implement cleanup logic to delete ServiceMonitor resources when monitoring is disabled.

3. Improve API documentation
I think that current problem is lack of clear documentation for monitoring.enabled field. We should inform that when enabled it will create ServiceMonitor resources to expose metrics for Prometheus Operator to scrape and explain requirement when enabled.

Expected user experience after changes

# Scenario: User creates instance without Prometheus Operator
$ kubectl apply -f fulcio.yaml

$ kubectl get fulcio my-fulcio
NAME        READY   REASON
my-fulcio   False   ServiceMonitorCRDNotAvailable

# Clear error message in status
$ kubectl get fulcio my-fulcio -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
ServiceMonitor CRD is not installed. To resolve: (1) Install Prometheus Operator, or (2) Set spec.monitoring.enabled=false

# User can easily fix by disabling monitoring
$ kubectl patch fulcio my-fulcio --type=merge -p '{"spec":{"monitoring":{"enabled":false}}}'
# Reconciliation succeeds, ServiceMonitor deleted

# Or by installing Prometheus Operator
$ kubectl apply -f prometheus-operator.yaml
# Next reconciliation succeeds, ServiceMonitor created

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to internal/action/monitoring/ to be consistent with other shared actions in the codebase. Add unit tests to verify the generic behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In response to the AI feedback
1 - I don't believe reconciliation should fail, since it's more important for TAS to be functional than to wait for the monitoring. Given this, I believe returning i.Continue is the right response, along with the events giving the current status. The events on the owned resources will inform the user about the current status, and the code will create the ServiceMonitor if/when prometheus is installed.
2 - agreed about the singleton pattern, this can be changed to ease testing. The internal data is driven by the watches, and is intended to reduce API usage.
3 - creating a resource is not a replacement for watching CRDs. Creating a resource will require polling/retries whereas the watch reacts dynamically to the creation of the CRD.

In summary to AI feedback, I'll handle the singleton pattern and push it up again.

In response to personal feedback, I can move to internal/action/monitoring and add the unit tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in response to the AI feedback for 1.

Having monitoring work, without the ServiceMonitor, is desirable and should be a requirement. The metrics ports should be there so they can be consumed by prometheus alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osmman I've changed the pattern so the registry is now injected from main, and added tests covering the new functionality. I also moved the code into the package you suggested. Let me know what you think

@knrc knrc requested a review from osmman November 13, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants