Skip to content

Conversation

@bergmannf
Copy link
Contributor

What type of PR is this?

feature

What this PR does / Why we need it?

This PR sets the groundwork to split the investigation and execution of resulting action into two separate modules.
Instead of having investigations do everything, they should just report correct actions based on findings which can then be performed at a later stage.
At this later stage it could also be decided if all or only some should be done / be done in parallel.

This is a very WIP PR and as an example it's moving the CHGM investigation to the new architecture.

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • Added tests
  • Created jira card to add unit test
  • This PR may not need unit tests

Pre-checks (if applicable)

  • Ran unit tests locally
  • Validated the changes in a cluster
  • Included documentation changes with PR

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 11, 2025
@openshift-ci openshift-ci bot requested review from bng0y and petrkotas November 11, 2025 15:43
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bergmannf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2025
bergmannf and others added 3 commits November 12, 2025 10:19
This commit introduces a new executor module that decouples investigation
logic from external system actions. Key changes:

**New packages:**
- pkg/investigations/types: Shared Action and ExecutionContext types
- pkg/executor: Executor implementation with action builders

**Architecture:**
- Action interface defines external system updates (service logs, limited
  support, PagerDuty notes, incident management)
- ExecutionContext provides strongly-typed resources for action execution
- Executor executes actions with retry logic and error handling
- Builder pattern for constructing actions and results

**Features:**
- Type-safe action builders (no interface{} types)
- Sequential and concurrent action execution
- Automatic retry for transient failures
- Validation before execution
- Support for dry-run mode

**Backwards compatibility:**
- InvestigationResult extended with Actions field
- Legacy fields (ServiceLogSent, etc.) preserved
- No breaking changes to existing investigations

Related documentation:
- docs/architecture/builder-pattern.md: Builder pattern guide
This commit migrates the Cluster Has Gone Missing (CHGM) investigation to
use the new executor action pattern while maintaining proper error handling
for infrastructure failures.

**Key changes:**

1. **Action-based remediation:**
   - Replaced direct OCM/PagerDuty calls with declarative actions
   - Service logs, limited support, notes, and incident management now use
     executor action builders
   - Investigation returns actions to be executed by the executor

2. **Smart error handling:**
   - Added isInfrastructureError() helper to distinguish between:
     * Infrastructure failures (AWS/OCM API errors) → return error for retry
     * Investigation findings (data too old, inconclusive) → return actions
   - Resource building errors continue to return errors (no change)

3. **Removed dependencies:**
   - Deleted postStoppedInfraLimitedSupport() helper function
   - Removed utils.WithRetries() usage
   - Removed unused pagerduty package import

4. **Investigation outcomes now use actions:**
   - Stopped instances → Limited support + silence
   - Egress blocked (deadman's snitch) → Limited support + silence
   - Egress blocked (other URLs) → Service log + escalate
   - No issues found → Note + escalate
   - Investigation incomplete → Note + escalate

**Benefits:**
- Separation of investigation logic from external system updates
- Automatic retry for transient failures via executor
- Type-safe action builders (no interface{} types)
- Declarative results easier to test and understand
- Consistent error handling across investigations

**Breaking change:**
Tests need to be updated to verify actions instead of mocking direct
OCM/PagerDuty calls. This will be addressed in a follow-up commit.

Related: Phase 1 of executor module implementation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit updates all 28 test cases in the CHGM investigation to verify
the new executor action pattern instead of mocking direct OCM/PagerDuty calls.

**Test infrastructure updates:**

1. **Added helper functions:**
   - hasActionType() - Generic action type checker
   - hasLimitedSupportAction() - Check for limited support actions
   - hasSilenceAction() - Check for silence incident actions
   - hasEscalateAction() - Check for escalate incident actions
   - hasServiceLogAction() - Check for service log actions
   - hasNoteAction() - Check for PagerDuty note actions

2. **Removed old mock expectations:**
   - Removed all .EXPECT().PostLimitedSupportReason() calls
   - Removed all .EXPECT().SilenceIncidentWithNote() calls
   - Removed all .EXPECT().EscalateIncidentWithNote() calls
   - Removed all .EXPECT().PostServiceLog() calls

3. **Updated test expectations:**
   - Tests expecting limited support now verify LS + Silence + Note actions
   - Tests expecting escalation now verify Escalate + Note actions
   - Infrastructure error tests verify error is returned (no actions)
   - Investigation finding tests verify actions are returned (no error)

4. **Fixed isInfrastructureError() helper:**
   - Added CloudTrail parsing errors to investigation findings:
     * "could not extractUserDetails"
     * "failed to parse CloudTrail event version"
     * "cannot parse a nil input"
   - These are data quality issues, not transient infrastructure failures
   - Should be reported via actions, not retried

**Test results:**
✅ All 28 tests passing
- 24 tests verify action-based outcomes
- 2 tests verify infrastructure error returns
- 2 tests verify investigation finding escalations

**Benefits:**
- Tests now verify behavior (what actions are taken) not implementation
  (which mock methods are called)
- More resilient to refactoring - don't break when internal calls change
- Easier to understand - test assertions match user-visible outcomes
- Consistent with executor pattern - actions are the contract

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@bergmannf bergmannf force-pushed the feature/reporter-module branch from c9297c0 to ac92395 Compare November 12, 2025 09:32
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 12, 2025
@codecov-commenter
Copy link

codecov-commenter commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 16.57658% with 463 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.01%. Comparing base (6cdfc0b) to head (61e430b).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
pkg/executor/executor.go 0.00% 140 Missing ⚠️
pkg/executor/action_builders.go 36.13% 75 Missing and 1 partial ⚠️
pkg/executor/actions.go 9.75% 74 Missing ⚠️
pkg/executor/result_builder.go 0.00% 39 Missing ⚠️
...rrorbudgetburn/clustermonitoringerrorbudgetburn.go 0.00% 34 Missing ⚠️
cadctl/cmd/investigate/investigate.go 0.00% 33 Missing ⚠️
pkg/executor/reporter.go 0.00% 30 Missing ⚠️
pkg/executor/errors.go 0.00% 19 Missing ⚠️
pkg/investigations/chgm/chgm.go 69.49% 17 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #618      +/-   ##
==========================================
- Coverage   35.15%   32.01%   -3.14%     
==========================================
  Files          42       49       +7     
  Lines        2728     3345     +617     
==========================================
+ Hits          959     1071     +112     
- Misses       1700     2196     +496     
- Partials       69       78       +9     
Files with missing lines Coverage Δ
pkg/investigations/investigation/investigation.go 15.83% <ø> (-4.68%) ⬇️
pkg/investigations/chgm/chgm.go 64.88% <69.49%> (+3.17%) ⬆️
pkg/executor/errors.go 0.00% <0.00%> (ø)
pkg/executor/reporter.go 0.00% <0.00%> (ø)
cadctl/cmd/investigate/investigate.go 0.00% <0.00%> (ø)
...rrorbudgetburn/clustermonitoringerrorbudgetburn.go 12.71% <0.00%> (-2.00%) ⬇️
pkg/executor/result_builder.go 0.00% <0.00%> (ø)
pkg/executor/actions.go 9.75% <9.75%> (ø)
pkg/executor/action_builders.go 36.13% <36.13%> (ø)
pkg/executor/executor.go 0.00% <0.00%> (ø)

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Extends the main investigate command to automatically execute actions
returned by investigations using the executor framework. This completes
the migration from direct external system calls to the action pattern.

Changes:
- Add executeActions() helper that creates an executor and runs all
  actions from an InvestigationResult
- Execute CCAM actions after CCAM investigation runs
- Execute main investigation actions after the alert investigation runs
- Configure executor with sensible defaults: 3 retries, concurrent
  execution for independent actions, continue on error
- Log action execution success/failure for observability

This enables investigations like CHGM to return actions instead of
directly calling OCM/PagerDuty APIs, improving testability and
separation of concerns.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@bergmannf bergmannf force-pushed the feature/reporter-module branch from ac92395 to 6f9cfff Compare November 12, 2025 11:42
@bergmannf bergmannf changed the title [WIP][SREP-2432] Add executor module [SREP-2432] Add executor module Nov 18, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2025

// EXISTING: Legacy fields (deprecated, maintained for backwards compatibility)
LimitedSupportSet InvestigationStep
ServiceLogPrepared InvestigationStep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: do we keep something equivalent to servicelog prepared?

I think we could formalize the "informing" state of an investigation pretty easily after this change.
If the investigation is marked as informing, the actions are not executed but only posted to pd?
And then we wont need the servicelog perpared anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these fields would be dropped once we migrate all investigations to the new architecture.
Metrics increases should then either be handled in the executor or in another module that consumes the same data structure as the executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've however moved updating the metrics to the executor now which is way cleaner.

}
}

// Default: assume infrastructure error to be safe (retry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: if we assume infra errors unless findingPatterns matches,I think we can remove the infraPatterns
or vice versa

on the other hand, we will need to make sure to abort the retries within a reasonable time frame to not delay the alert too much. A list of retryable errors could help with controlling this time, but comes at a higher maintenance/dev cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that retry comment was Claude hallucinating it seems - it could be retried, but we have absolutely no retry logic for investigations at all yet.

bergmannf and others added 3 commits November 19, 2025 11:54
…metrics emission

This commit migrates the clustermonitoringerrorbudgetburn investigation to use
the new executor action pattern and improves the architecture by moving metrics
emission to the executor itself.

Changes:
- Executor now emits metrics after successful action execution (pkg/executor/executor.go)
  - Added emitMetricsForAction() to emit servicelog_sent and limitedsupport_set metrics
  - Metrics are only incremented when actions actually succeed, ensuring accuracy

- Migrated clustermonitoringerrorbudgetburn investigation (pkg/investigations/clustermonitoringerrorbudgetburn/)
  - Removed direct PagerDuty and OCM client calls
  - Returns executor actions instead of executing operations directly
  - All scenarios now use action builders (ServiceLog, LimitedSupport, Note, Silence, Escalate)

- Cleaned up CHGM investigation (pkg/investigations/chgm/)
  - Removed redundant InvestigationStep field assignments
  - Actions are now the single source of truth

- Updated CHGM tests (pkg/investigations/chgm/chgm_test.go)
  - Removed obsolete InvestigationStep field checks
  - Tests now only verify actions are properly constructed

Architecture improvement: Investigations focus on logic and return actions,
while the executor handles execution, retry logic, and metrics emission.
This ensures metrics accurately reflect what actually happened.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit adds comprehensive documentation explaining how investigations
should be implemented using the executor pattern.

Key topics covered:
- Executor pattern architecture and benefits
- Available action types (ServiceLog, LimitedSupport, PagerDuty actions)
- Action execution order and parallelization
- Common investigation patterns with code examples
- Error handling best practices
- Automatic metrics emission
- Migration guide from old pattern to new
- Testing guidelines
- Complete reference with DO/DON'T examples

The documentation emphasizes that investigations should NOT:
- Call PagerDuty or OCM clients directly
- Manually track metrics via InvestigationStep fields
- Handle action execution failures

Instead, investigations should:
- Return actions via result.Actions
- Let the executor handle execution, retries, and metrics
- Focus on investigation logic and analysis

This provides a single source of truth for developers implementing
or migrating investigations to the executor pattern.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@bergmannf: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants