[SREP-2432] Add executor module #618

bergmannf · 2025-11-11T15:42:38Z

What type of PR is this?

feature

What this PR does / Why we need it?

This PR sets the groundwork to split the investigation and execution of resulting action into two separate modules.
Instead of having investigations do everything, they should just report correct actions based on findings which can then be performed at a later stage.
At this later stage it could also be decided if all or only some should be done / be done in parallel.

This is a very WIP PR and as an example it's moving the CHGM investigation to the new architecture.

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

Added tests
Created jira card to add unit test
This PR may not need unit tests

Pre-checks (if applicable)

Ran unit tests locally
Validated the changes in a cluster
Included documentation changes with PR

openshift-ci · 2025-11-11T15:43:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bergmannf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bergmannf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This commit introduces a new executor module that decouples investigation logic from external system actions. Key changes: **New packages:** - pkg/investigations/types: Shared Action and ExecutionContext types - pkg/executor: Executor implementation with action builders **Architecture:** - Action interface defines external system updates (service logs, limited support, PagerDuty notes, incident management) - ExecutionContext provides strongly-typed resources for action execution - Executor executes actions with retry logic and error handling - Builder pattern for constructing actions and results **Features:** - Type-safe action builders (no interface{} types) - Sequential and concurrent action execution - Automatic retry for transient failures - Validation before execution - Support for dry-run mode **Backwards compatibility:** - InvestigationResult extended with Actions field - Legacy fields (ServiceLogSent, etc.) preserved - No breaking changes to existing investigations Related documentation: - docs/architecture/builder-pattern.md: Builder pattern guide

This commit migrates the Cluster Has Gone Missing (CHGM) investigation to use the new executor action pattern while maintaining proper error handling for infrastructure failures. **Key changes:** 1. **Action-based remediation:** - Replaced direct OCM/PagerDuty calls with declarative actions - Service logs, limited support, notes, and incident management now use executor action builders - Investigation returns actions to be executed by the executor 2. **Smart error handling:** - Added isInfrastructureError() helper to distinguish between: * Infrastructure failures (AWS/OCM API errors) → return error for retry * Investigation findings (data too old, inconclusive) → return actions - Resource building errors continue to return errors (no change) 3. **Removed dependencies:** - Deleted postStoppedInfraLimitedSupport() helper function - Removed utils.WithRetries() usage - Removed unused pagerduty package import 4. **Investigation outcomes now use actions:** - Stopped instances → Limited support + silence - Egress blocked (deadman's snitch) → Limited support + silence - Egress blocked (other URLs) → Service log + escalate - No issues found → Note + escalate - Investigation incomplete → Note + escalate **Benefits:** - Separation of investigation logic from external system updates - Automatic retry for transient failures via executor - Type-safe action builders (no interface{} types) - Declarative results easier to test and understand - Consistent error handling across investigations **Breaking change:** Tests need to be updated to verify actions instead of mocking direct OCM/PagerDuty calls. This will be addressed in a follow-up commit. Related: Phase 1 of executor module implementation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This commit updates all 28 test cases in the CHGM investigation to verify the new executor action pattern instead of mocking direct OCM/PagerDuty calls. **Test infrastructure updates:** 1. **Added helper functions:** - hasActionType() - Generic action type checker - hasLimitedSupportAction() - Check for limited support actions - hasSilenceAction() - Check for silence incident actions - hasEscalateAction() - Check for escalate incident actions - hasServiceLogAction() - Check for service log actions - hasNoteAction() - Check for PagerDuty note actions 2. **Removed old mock expectations:** - Removed all .EXPECT().PostLimitedSupportReason() calls - Removed all .EXPECT().SilenceIncidentWithNote() calls - Removed all .EXPECT().EscalateIncidentWithNote() calls - Removed all .EXPECT().PostServiceLog() calls 3. **Updated test expectations:** - Tests expecting limited support now verify LS + Silence + Note actions - Tests expecting escalation now verify Escalate + Note actions - Infrastructure error tests verify error is returned (no actions) - Investigation finding tests verify actions are returned (no error) 4. **Fixed isInfrastructureError() helper:** - Added CloudTrail parsing errors to investigation findings: * "could not extractUserDetails" * "failed to parse CloudTrail event version" * "cannot parse a nil input" - These are data quality issues, not transient infrastructure failures - Should be reported via actions, not retried **Test results:** ✅ All 28 tests passing - 24 tests verify action-based outcomes - 2 tests verify infrastructure error returns - 2 tests verify investigation finding escalations **Benefits:** - Tests now verify behavior (what actions are taken) not implementation (which mock methods are called) - More resilient to refactoring - don't break when internal calls change - Easier to understand - test assertions match user-visible outcomes - Consistent with executor pattern - actions are the contract 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

codecov-commenter · 2025-11-12T09:44:03Z

Codecov Report

❌ Patch coverage is 16.57658% with 463 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.01%. Comparing base (6cdfc0b) to head (61e430b).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/executor/executor.go	0.00%	140 Missing ⚠️
pkg/executor/action_builders.go	36.13%	75 Missing and 1 partial ⚠️
pkg/executor/actions.go	9.75%	74 Missing ⚠️
pkg/executor/result_builder.go	0.00%	39 Missing ⚠️
...rrorbudgetburn/clustermonitoringerrorbudgetburn.go	0.00%	34 Missing ⚠️
cadctl/cmd/investigate/investigate.go	0.00%	33 Missing ⚠️
pkg/executor/reporter.go	0.00%	30 Missing ⚠️
pkg/executor/errors.go	0.00%	19 Missing ⚠️
pkg/investigations/chgm/chgm.go	69.49%	17 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #618      +/-   ##
==========================================
- Coverage   35.15%   32.01%   -3.14%     
==========================================
  Files          42       49       +7     
  Lines        2728     3345     +617     
==========================================
+ Hits          959     1071     +112     
- Misses       1700     2196     +496     
- Partials       69       78       +9

Files with missing lines	Coverage Δ
pkg/investigations/investigation/investigation.go	`15.83% <ø> (-4.68%)`	⬇️
pkg/investigations/chgm/chgm.go	`64.88% <69.49%> (+3.17%)`	⬆️
pkg/executor/errors.go	`0.00% <0.00%> (ø)`
pkg/executor/reporter.go	`0.00% <0.00%> (ø)`
cadctl/cmd/investigate/investigate.go	`0.00% <0.00%> (ø)`
...rrorbudgetburn/clustermonitoringerrorbudgetburn.go	`12.71% <0.00%> (-2.00%)`	⬇️
pkg/executor/result_builder.go	`0.00% <0.00%> (ø)`
pkg/executor/actions.go	`9.75% <9.75%> (ø)`
pkg/executor/action_builders.go	`36.13% <36.13%> (ø)`
pkg/executor/executor.go	`0.00% <0.00%> (ø)`

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Extends the main investigate command to automatically execute actions returned by investigations using the executor framework. This completes the migration from direct external system calls to the action pattern. Changes: - Add executeActions() helper that creates an executor and runs all actions from an InvestigationResult - Execute CCAM actions after CCAM investigation runs - Execute main investigation actions after the alert investigation runs - Configure executor with sensible defaults: 3 retries, concurrent execution for independent actions, continue on error - Log action execution success/failure for observability This enables investigations like CHGM to return actions instead of directly calling OCM/PagerDuty APIs, improving testability and separation of concerns. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

RaphaelBut · 2025-11-18T10:33:25Z

pkg/investigations/investigation/investigation.go

+
+	// EXISTING: Legacy fields (deprecated, maintained for backwards compatibility)
 	LimitedSupportSet  InvestigationStep
 	ServiceLogPrepared InvestigationStep


thought: do we keep something equivalent to servicelog prepared?

I think we could formalize the "informing" state of an investigation pretty easily after this change.
If the investigation is marked as informing, the actions are not executed but only posted to pd?
And then we wont need the servicelog perpared anymore

All of these fields would be dropped once we migrate all investigations to the new architecture.
Metrics increases should then either be handled in the executor or in another module that consumes the same data structure as the executor.

I've however moved updating the metrics to the executor now which is way cleaner.

RaphaelBut · 2025-11-18T10:44:14Z

pkg/investigations/chgm/chgm.go

+		}
+	}
+
+	// Default: assume infrastructure error to be safe (retry)


thought: if we assume infra errors unless findingPatterns matches,I think we can remove the infraPatterns
or vice versa

on the other hand, we will need to make sure to abort the retries within a reasonable time frame to not delay the alert too much. A list of retryable errors could help with controlling this time, but comes at a higher maintenance/dev cost.

Ah that retry comment was Claude hallucinating it seems - it could be retried, but we have absolutely no retry logic for investigations at all yet.

…metrics emission This commit migrates the clustermonitoringerrorbudgetburn investigation to use the new executor action pattern and improves the architecture by moving metrics emission to the executor itself. Changes: - Executor now emits metrics after successful action execution (pkg/executor/executor.go) - Added emitMetricsForAction() to emit servicelog_sent and limitedsupport_set metrics - Metrics are only incremented when actions actually succeed, ensuring accuracy - Migrated clustermonitoringerrorbudgetburn investigation (pkg/investigations/clustermonitoringerrorbudgetburn/) - Removed direct PagerDuty and OCM client calls - Returns executor actions instead of executing operations directly - All scenarios now use action builders (ServiceLog, LimitedSupport, Note, Silence, Escalate) - Cleaned up CHGM investigation (pkg/investigations/chgm/) - Removed redundant InvestigationStep field assignments - Actions are now the single source of truth - Updated CHGM tests (pkg/investigations/chgm/chgm_test.go) - Removed obsolete InvestigationStep field checks - Tests now only verify actions are properly constructed Architecture improvement: Investigations focus on logic and return actions, while the executor handles execution, retry logic, and metrics emission. This ensures metrics accurately reflect what actually happened. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

This commit adds comprehensive documentation explaining how investigations should be implemented using the executor pattern. Key topics covered: - Executor pattern architecture and benefits - Available action types (ServiceLog, LimitedSupport, PagerDuty actions) - Action execution order and parallelization - Common investigation patterns with code examples - Error handling best practices - Automatic metrics emission - Migration guide from old pattern to new - Testing guidelines - Complete reference with DO/DON'T examples The documentation emphasizes that investigations should NOT: - Call PagerDuty or OCM clients directly - Manually track metrics via InvestigationStep fields - Handle action execution failures Instead, investigations should: - Return actions via result.Actions - Let the executor handle execution, retries, and metrics - Focus on investigation logic and analysis This provides a single source of truth for developers implementing or migrating investigations to the executor pattern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

openshift-ci · 2025-11-21T11:49:02Z

@bergmannf: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 11, 2025

openshift-ci bot requested review from bng0y and petrkotas November 11, 2025 15:43

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2025

bergmannf and others added 3 commits November 12, 2025 10:19

bergmannf force-pushed the feature/reporter-module branch from c9297c0 to ac92395 Compare November 12, 2025 09:32

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 12, 2025

bergmannf force-pushed the feature/reporter-module branch from ac92395 to 6f9cfff Compare November 12, 2025 11:42

Move shared types out of investigation folder.

230c5cc

bergmannf changed the title ~~[WIP][SREP-2432] Add executor module~~ [SREP-2432] Add executor module Nov 18, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2025

RaphaelBut reviewed Nov 18, 2025

View reviewed changes

bergmannf and others added 3 commits November 19, 2025 11:54

Update architecture docs.

0aeb553

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SREP-2432] Add executor module #618

[SREP-2432] Add executor module #618

Uh oh!

bergmannf commented Nov 11, 2025

Uh oh!

openshift-ci bot commented Nov 11, 2025

Uh oh!

codecov-commenter commented Nov 12, 2025 •

edited

Loading

Uh oh!

RaphaelBut Nov 18, 2025

Uh oh!

bergmannf Nov 18, 2025

Uh oh!

bergmannf Nov 21, 2025

Uh oh!

RaphaelBut Nov 18, 2025

Uh oh!

bergmannf Nov 21, 2025

Uh oh!

openshift-ci bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SREP-2432] Add executor module #618

Are you sure you want to change the base?

[SREP-2432] Add executor module #618

Uh oh!

Conversation

bergmannf commented Nov 11, 2025

What type of PR is this?

What this PR does / Why we need it?

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

Test coverage checks

Pre-checks (if applicable)

Uh oh!

openshift-ci bot commented Nov 11, 2025

Uh oh!

codecov-commenter commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RaphaelBut Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bergmannf Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bergmannf Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

RaphaelBut Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bergmannf Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Nov 12, 2025 •

edited

Loading