-
Notifications
You must be signed in to change notification settings - Fork 23
feat: SECURESIGN-3184: ensure ServiceMonitors are only created when s… #1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR centralizes and streamlines ServiceMonitor support by introducing a singleton registry, a CRD watcher, and a generic monitoring action, then refactoring all individual controller monitoring implementations to leverage these new abstractions, and wiring the CRD watcher into the manager with appropriate RBAC updates. Sequence diagram for ServiceMonitor CRD availability and reconciliationsequenceDiagram
participant Controller as "Controller (e.g. Trillian)"
participant Registry as "ServiceMonitorRegistry"
participant CRDWatcher as "CRDWatcherReconciler"
participant K8s as "Kubernetes API"
participant ServiceMonitor as "ServiceMonitor resource"
Controller->>Registry: Register ServiceMonitor spec
CRDWatcher->>K8s: Watch ServiceMonitor CRD
K8s-->>CRDWatcher: Notify CRD change (add/remove)
CRDWatcher->>Registry: Set API availability
Registry->>Controller: Emit event (API available/unavailable)
alt API available
Registry->>ServiceMonitor: Reconcile ServiceMonitor resource
Registry->>Controller: Emit event (reconciled)
else API unavailable
Registry->>Controller: Emit event (API unavailable)
end
Class diagram for ServiceMonitorRegistry and related typesclassDiagram
class ServiceMonitorRegistry {
- mutex: sync.RWMutex
- specs: map[types.NamespacedName]*ServiceMonitorSpec
- notifiedOwners: map[ownerKey]*ServiceMonitorSpec
- client: client.Client
- recorder: record.EventRecorder
- logger: logr.Logger
- apiAvailable: bool
+ Register(ctx, spec, owner)
+ GetAll()
+ SetAPIAvailable(available)
+ IsAPIAvailable()
+ EmitEventToOwners(ctx, eventType, reason, message)
+ ReconcileAll(ctx)
+ ReconcileOne(ctx, spec)
}
class ServiceMonitorSpec {
+ OwnerKey: types.NamespacedName
+ OwnerGVK: schema.GroupVersionKind
+ Namespace: string
+ Name: string
+ EnsureFuncs: []func(*unstructured.Unstructured) error
}
class CRDWatcherReconciler {
+ Client: client.Client
+ Scheme: *runtime.Scheme
+ Registry: *ServiceMonitorRegistry
+ Reconcile(ctx, req)
+ SetupWithManager(mgr)
}
class MonitoringConfig {
+ ComponentName: string
+ DeploymentName: string
+ MonitoringRoleName: string
+ MetricsPortName: string
+ IsMonitoringEnabled: func(T) bool
+ CustomEndpointBuilder: func(instance T) []func(*unstructured.Unstructured) error
}
class genericMonitoringAction {
+ config: MonitoringConfig
+ Name()
+ CanHandle(ctx, instance)
+ Handle(ctx, instance)
}
ServiceMonitorRegistry "1" o-- "*" ServiceMonitorSpec
CRDWatcherReconciler "1" --> "1" ServiceMonitorRegistry
genericMonitoringAction "1" --> "1" MonitoringConfig
Flow diagram for controller monitoring registration and reconciliationflowchart TD
A["Controller (e.g. Trillian)"] --> B["Create MonitoringConfig"]
B --> C["NewMonitoringAction"]
C --> D["Register ServiceMonitorSpec with Registry"]
D --> E["Registry checks API availability"]
E -->|API available| F["Reconcile ServiceMonitor"]
E -->|API unavailable| G["Emit event: API unavailable"]
F --> H["Emit event: ServiceMonitor reconciled"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The kubernetes.IsServiceMonitorAvailable function is never used—either integrate it into the APIService watcher or remove it to avoid dead code.
- Specs in the ServiceMonitorRegistry are only cleaned up when reconciliation discovers a missing owner; consider adding a watch or finalizer on the custom resource deletion to proactively remove stale specs and orphaned ServiceMonitors.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne even if the ServiceMonitor API isn’t available—short‐circuiting reconciliation when registry.IsAPIAvailable() is false would reduce unnecessary error logs and no-op calls.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The kubernetes.IsServiceMonitorAvailable function is never used—either integrate it into the APIService watcher or remove it to avoid dead code.
- Specs in the ServiceMonitorRegistry are only cleaned up when reconciliation discovers a missing owner; consider adding a watch or finalizer on the custom resource deletion to proactively remove stale specs and orphaned ServiceMonitors.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne even if the ServiceMonitor API isn’t available—short‐circuiting reconciliation when registry.IsAPIAvailable() is false would reduce unnecessary error logs and no-op calls.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
PR Code Suggestions ✨Latest suggestions up to 000e511
Previous suggestions✅ Suggestions up to commit 233c1da
|
|||||||||||||||||||||||||||
ebedc9e to
000e511
Compare
|
Note: following the qodo review I switched the code from monitoring the APIService to monitoring the CRD. This removes the race between the APIService being created and the CRD being created. |
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The ServiceMonitorRegistry is implemented as a package‐level singleton, which can make tests and interdependent reconcilers harder to manage—consider injecting or scoping the registry per‐manager instead of using a global instance.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne (which errors when the CRD is absent) and just log the error—consider checking registry.IsAPIAvailable first or surfacing a retryable error so the action doesn’t silently skip reconciliation.
- Currently owner cleanup only happens when a Get fails in getOwnerObject; you may want to actively watch owner deletions or add a finalizer to proactively deregister specs rather than waiting for lazy garbage‐collection.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The ServiceMonitorRegistry is implemented as a package‐level singleton, which can make tests and interdependent reconcilers harder to manage—consider injecting or scoping the registry per‐manager instead of using a global instance.
- In genericMonitoringAction.Handle you always call registry.ReconcileOne (which errors when the CRD is absent) and just log the error—consider checking registry.IsAPIAvailable first or surfacing a retryable error so the action doesn’t silently skip reconciliation.
- Currently owner cleanup only happens when a Get fails in getOwnerObject; you may want to actively watch owner deletions or add a finalizer to proactively deregister specs rather than waiting for lazy garbage‐collection.
## Individual Comments
### Comment 1
<location> `internal/controller/monitoring/registry.go:123` </location>
<code_context>
+}
+
+// cleanupOwner removes all specs for an owner and clears it from notifiedOwners
+func (r *ServiceMonitorRegistry) cleanupOwner(spec *ServiceMonitorSpec) {
+ r.mutex.Lock()
+ defer r.mutex.Unlock()
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Cleanup logic may leave orphaned specs if multiple ServiceMonitors share the same owner.
Currently, removing a spec only deletes that spec and the owner from notifiedOwners, potentially leaving other specs for the same owner. Please update the logic to remove all specs associated with the owner, or document that only one spec per owner is allowed.
Suggested implementation:
```golang
// cleanupOwner removes all specs for an owner and clears it from notifiedOwners
```
```golang
func (r *ServiceMonitorRegistry) cleanupOwner(spec *ServiceMonitorSpec) {
r.mutex.Lock()
defer r.mutex.Unlock()
ownerKey := ownerKey{NamespacedName: spec.OwnerKey, GVK: spec.OwnerGVK}
// Remove all specs associated with this owner
for key, s := range r.specs {
if s.OwnerKey == spec.OwnerKey && s.OwnerGVK == spec.OwnerGVK {
delete(r.specs, key)
}
}
// Remove owner from notifiedOwners
delete(r.notifiedOwners, ownerKey)
}
```
</issue_to_address>
### Comment 2
<location> `internal/controller/monitoring/registry.go:165` </location>
<code_context>
+}
+
+// EmitEventToOwners emits an event to all unique owner objects that have registered ServiceMonitors
+func (r *ServiceMonitorRegistry) EmitEventToOwners(ctx context.Context, eventType, reason, message string) {
+ r.mutex.RLock()
+ specsToNotify := make([]*ServiceMonitorSpec, 0, len(r.notifiedOwners))
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Emitting events to owners may fail silently if owner objects are deleted.
Consider whether errors from deleted owner objects should trigger additional cleanup or notification to maintain registry consistency with cluster state.
Suggested implementation:
```golang
for _, spec := range specsToNotify {
err := r.emitEventToOwner(ctx, spec, eventType, reason, message)
if err != nil {
// Check if error is due to owner object being deleted (NotFound)
if errors.IsNotFound(err) {
// Remove owner from registry to maintain consistency
r.mutex.Lock()
delete(r.notifiedOwners, spec.OwnerKey())
r.mutex.Unlock()
// Optionally log the cleanup
klog.Infof("Removed deleted owner %s from ServiceMonitorRegistry", spec.OwnerKey())
} else {
// Optionally log other errors
klog.Errorf("Failed to emit event to owner %s: %v", spec.OwnerKey(), err)
}
}
}
}
```
1. Ensure that `emitEventToOwner` returns an error (update its signature and implementation if necessary).
2. Import `"k8s.io/apimachinery/pkg/api/errors"` and `"k8s.io/klog/v2"` if not already present.
3. Implement or verify the existence of `OwnerKey()` on `ServiceMonitorSpec` for unique identification.
</issue_to_address>
### Comment 3
<location> `internal/controller/monitoring/registry.go:229` </location>
<code_context>
+}
+
+// ReconcileOne creates or updates a single ServiceMonitor
+func (r *ServiceMonitorRegistry) ReconcileOne(ctx context.Context, spec *ServiceMonitorSpec) error {
+ if !r.IsAPIAvailable() {
+ return fmt.Errorf("ServiceMonitor API not available")
</code_context>
<issue_to_address>
**issue (bug_risk):** ReconcileOne does not check for nil EnsureFuncs, which could cause a panic.
Add a check to ensure spec.EnsureFuncs is not nil before invoking CreateOrUpdate to prevent a potential panic.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
osmman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces unnecessary complexity with stateful registries and CRD watchers. The implementation doesn't follow Kubernetes operator best practices and creates situations where monitoring.enabled: true succeeds even when monitoring isn't actually enabled (not reflecting actual state of the system).
Critical issues:
1. Configuration mismatch
When monitoring.enabled: true but ServiceMonitor CRD doesn't exist, reconciliation returns success. The action always returns i.Continue() regardless of whether ServiceMonitor was created.
This means:
- Status shows
Ready: true - Spec declares
monitoring.enabled: true - But monitoring is NOT actually enabled ❌
Expected behavior:
- If
monitoring.enabled: trueand ServiceMonitor CRD is not available, reconciliation should fail with an error - The Status condition should indicate the problem (e.g.,
reason: ServiceMonitorCRDNotAvailable) - An event should be emitted explaining the issue
- Reconciliation should retry until the user either:
- Installs the ServiceMonitor CRD, or
- Sets
monitoring.enabled: false
2. Stateful registry pattern
The implementation uses stateful in-memory registries (ServiceMonitorRegistry). Operators should be stateless and rely on the Kubernetes API for current state.
Using singleton patter makes the code harder to test.
3. Permissions and CRD reconciler
The implementation requires cluster-wide permissions to watch all CRDs, which increases the operator's privilege footprint.
Simply attempt to create the ServiceMonitor. If the API is not available, the Kubernetes API server will return an error. Handle that error appropriately and update the Status.
Proposed solution:
1. improve error handling
Improve error handling and messages for actions to identify missing ServiceMonitor CRD and produce well informed error message in CRD status, Warning event and Operator log.
2. Implement monitoring cleanup functionality
Current implementation has disabled transition of monitoring.enabled field false → true via validation rules. That limitation will cause issues for users to fix broken instances (unfortunately default values is True), we need to implement cleanup logic to delete ServiceMonitor resources when monitoring is disabled.
3. Improve API documentation
I think that current problem is lack of clear documentation for monitoring.enabled field. We should inform that when enabled it will create ServiceMonitor resources to expose metrics for Prometheus Operator to scrape and explain requirement when enabled.
Expected user experience after changes
# Scenario: User creates instance without Prometheus Operator
$ kubectl apply -f fulcio.yaml
$ kubectl get fulcio my-fulcio
NAME READY REASON
my-fulcio False ServiceMonitorCRDNotAvailable
# Clear error message in status
$ kubectl get fulcio my-fulcio -o jsonpath='{.status.conditions[?(@.type=="Ready")].message}'
ServiceMonitor CRD is not installed. To resolve: (1) Install Prometheus Operator, or (2) Set spec.monitoring.enabled=false
# User can easily fix by disabling monitoring
$ kubectl patch fulcio my-fulcio --type=merge -p '{"spec":{"monitoring":{"enabled":false}}}'
# Reconciliation succeeds, ServiceMonitor deleted
# Or by installing Prometheus Operator
$ kubectl apply -f prometheus-operator.yaml
# Next reconciliation succeeds, ServiceMonitor createdThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to internal/action/monitoring/ to be consistent with other shared actions in the codebase. Add unit tests to verify the generic behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In response to the AI feedback
1 - I don't believe reconciliation should fail, since it's more important for TAS to be functional than to wait for the monitoring. Given this, I believe returning i.Continue is the right response, along with the events giving the current status. The events on the owned resources will inform the user about the current status, and the code will create the ServiceMonitor if/when prometheus is installed.
2 - agreed about the singleton pattern, this can be changed to ease testing. The internal data is driven by the watches, and is intended to reduce API usage.
3 - creating a resource is not a replacement for watching CRDs. Creating a resource will require polling/retries whereas the watch reacts dynamically to the creation of the CRD.
In summary to AI feedback, I'll handle the singleton pattern and push it up again.
In response to personal feedback, I can move to internal/action/monitoring and add the unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in response to the AI feedback for 1.
Having monitoring work, without the ServiceMonitor, is desirable and should be a requirement. The metrics ports should be there so they can be consumed by prometheus alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osmman I've changed the pattern so the registry is now injected from main, and added tests covering the new functionality. I also moved the code into the package you suggested. Let me know what you think
000e511 to
ecccf74
Compare
ecccf74 to
6f05e7d
Compare
…upported in the cluster Signed-off-by: Kevin Conner <[email protected]>
6f05e7d to
4510d9e
Compare
…upported in the cluster
This PR adds a feature to automatically track the
ServiceMonitorsupport within a cluster, create status events against the owners of those resources, and reconcile the resources when updated and the API is supported.Summary by Sourcery
Add a centralized ServiceMonitor framework by introducing a registry, CRD watcher, and generic monitoring action, refactor existing controllers to leverage the new framework, and update RBAC and main.go to support ServiceMonitor API availability.
New Features:
Enhancements:
Build: