-
Notifications
You must be signed in to change notification settings - Fork 57
[SREP-2432] Add executor module #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bergmannf The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This commit introduces a new executor module that decouples investigation
logic from external system actions. Key changes:
**New packages:**
- pkg/investigations/types: Shared Action and ExecutionContext types
- pkg/executor: Executor implementation with action builders
**Architecture:**
- Action interface defines external system updates (service logs, limited
support, PagerDuty notes, incident management)
- ExecutionContext provides strongly-typed resources for action execution
- Executor executes actions with retry logic and error handling
- Builder pattern for constructing actions and results
**Features:**
- Type-safe action builders (no interface{} types)
- Sequential and concurrent action execution
- Automatic retry for transient failures
- Validation before execution
- Support for dry-run mode
**Backwards compatibility:**
- InvestigationResult extended with Actions field
- Legacy fields (ServiceLogSent, etc.) preserved
- No breaking changes to existing investigations
Related documentation:
- docs/architecture/builder-pattern.md: Builder pattern guide
This commit migrates the Cluster Has Gone Missing (CHGM) investigation to
use the new executor action pattern while maintaining proper error handling
for infrastructure failures.
**Key changes:**
1. **Action-based remediation:**
- Replaced direct OCM/PagerDuty calls with declarative actions
- Service logs, limited support, notes, and incident management now use
executor action builders
- Investigation returns actions to be executed by the executor
2. **Smart error handling:**
- Added isInfrastructureError() helper to distinguish between:
* Infrastructure failures (AWS/OCM API errors) → return error for retry
* Investigation findings (data too old, inconclusive) → return actions
- Resource building errors continue to return errors (no change)
3. **Removed dependencies:**
- Deleted postStoppedInfraLimitedSupport() helper function
- Removed utils.WithRetries() usage
- Removed unused pagerduty package import
4. **Investigation outcomes now use actions:**
- Stopped instances → Limited support + silence
- Egress blocked (deadman's snitch) → Limited support + silence
- Egress blocked (other URLs) → Service log + escalate
- No issues found → Note + escalate
- Investigation incomplete → Note + escalate
**Benefits:**
- Separation of investigation logic from external system updates
- Automatic retry for transient failures via executor
- Type-safe action builders (no interface{} types)
- Declarative results easier to test and understand
- Consistent error handling across investigations
**Breaking change:**
Tests need to be updated to verify actions instead of mocking direct
OCM/PagerDuty calls. This will be addressed in a follow-up commit.
Related: Phase 1 of executor module implementation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
This commit updates all 28 test cases in the CHGM investigation to verify
the new executor action pattern instead of mocking direct OCM/PagerDuty calls.
**Test infrastructure updates:**
1. **Added helper functions:**
- hasActionType() - Generic action type checker
- hasLimitedSupportAction() - Check for limited support actions
- hasSilenceAction() - Check for silence incident actions
- hasEscalateAction() - Check for escalate incident actions
- hasServiceLogAction() - Check for service log actions
- hasNoteAction() - Check for PagerDuty note actions
2. **Removed old mock expectations:**
- Removed all .EXPECT().PostLimitedSupportReason() calls
- Removed all .EXPECT().SilenceIncidentWithNote() calls
- Removed all .EXPECT().EscalateIncidentWithNote() calls
- Removed all .EXPECT().PostServiceLog() calls
3. **Updated test expectations:**
- Tests expecting limited support now verify LS + Silence + Note actions
- Tests expecting escalation now verify Escalate + Note actions
- Infrastructure error tests verify error is returned (no actions)
- Investigation finding tests verify actions are returned (no error)
4. **Fixed isInfrastructureError() helper:**
- Added CloudTrail parsing errors to investigation findings:
* "could not extractUserDetails"
* "failed to parse CloudTrail event version"
* "cannot parse a nil input"
- These are data quality issues, not transient infrastructure failures
- Should be reported via actions, not retried
**Test results:**
✅ All 28 tests passing
- 24 tests verify action-based outcomes
- 2 tests verify infrastructure error returns
- 2 tests verify investigation finding escalations
**Benefits:**
- Tests now verify behavior (what actions are taken) not implementation
(which mock methods are called)
- More resilient to refactoring - don't break when internal calls change
- Easier to understand - test assertions match user-visible outcomes
- Consistent with executor pattern - actions are the contract
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
c9297c0 to
ac92395
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #618 +/- ##
==========================================
- Coverage 35.15% 32.01% -3.14%
==========================================
Files 42 49 +7
Lines 2728 3345 +617
==========================================
+ Hits 959 1071 +112
- Misses 1700 2196 +496
- Partials 69 78 +9
🚀 New features to boost your workflow:
|
Extends the main investigate command to automatically execute actions returned by investigations using the executor framework. This completes the migration from direct external system calls to the action pattern. Changes: - Add executeActions() helper that creates an executor and runs all actions from an InvestigationResult - Execute CCAM actions after CCAM investigation runs - Execute main investigation actions after the alert investigation runs - Configure executor with sensible defaults: 3 retries, concurrent execution for independent actions, continue on error - Log action execution success/failure for observability This enables investigations like CHGM to return actions instead of directly calling OCM/PagerDuty APIs, improving testability and separation of concerns. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
ac92395 to
6f9cfff
Compare
|
|
||
| // EXISTING: Legacy fields (deprecated, maintained for backwards compatibility) | ||
| LimitedSupportSet InvestigationStep | ||
| ServiceLogPrepared InvestigationStep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: do we keep something equivalent to servicelog prepared?
I think we could formalize the "informing" state of an investigation pretty easily after this change.
If the investigation is marked as informing, the actions are not executed but only posted to pd?
And then we wont need the servicelog perpared anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these fields would be dropped once we migrate all investigations to the new architecture.
Metrics increases should then either be handled in the executor or in another module that consumes the same data structure as the executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've however moved updating the metrics to the executor now which is way cleaner.
| } | ||
| } | ||
|
|
||
| // Default: assume infrastructure error to be safe (retry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: if we assume infra errors unless findingPatterns matches,I think we can remove the infraPatterns
or vice versa
on the other hand, we will need to make sure to abort the retries within a reasonable time frame to not delay the alert too much. A list of retryable errors could help with controlling this time, but comes at a higher maintenance/dev cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that retry comment was Claude hallucinating it seems - it could be retried, but we have absolutely no retry logic for investigations at all yet.
…metrics emission This commit migrates the clustermonitoringerrorbudgetburn investigation to use the new executor action pattern and improves the architecture by moving metrics emission to the executor itself. Changes: - Executor now emits metrics after successful action execution (pkg/executor/executor.go) - Added emitMetricsForAction() to emit servicelog_sent and limitedsupport_set metrics - Metrics are only incremented when actions actually succeed, ensuring accuracy - Migrated clustermonitoringerrorbudgetburn investigation (pkg/investigations/clustermonitoringerrorbudgetburn/) - Removed direct PagerDuty and OCM client calls - Returns executor actions instead of executing operations directly - All scenarios now use action builders (ServiceLog, LimitedSupport, Note, Silence, Escalate) - Cleaned up CHGM investigation (pkg/investigations/chgm/) - Removed redundant InvestigationStep field assignments - Actions are now the single source of truth - Updated CHGM tests (pkg/investigations/chgm/chgm_test.go) - Removed obsolete InvestigationStep field checks - Tests now only verify actions are properly constructed Architecture improvement: Investigations focus on logic and return actions, while the executor handles execution, retry logic, and metrics emission. This ensures metrics accurately reflect what actually happened. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This commit adds comprehensive documentation explaining how investigations should be implemented using the executor pattern. Key topics covered: - Executor pattern architecture and benefits - Available action types (ServiceLog, LimitedSupport, PagerDuty actions) - Action execution order and parallelization - Common investigation patterns with code examples - Error handling best practices - Automatic metrics emission - Migration guide from old pattern to new - Testing guidelines - Complete reference with DO/DON'T examples The documentation emphasizes that investigations should NOT: - Call PagerDuty or OCM clients directly - Manually track metrics via InvestigationStep fields - Handle action execution failures Instead, investigations should: - Return actions via result.Actions - Let the executor handle execution, retries, and metrics - Focus on investigation logic and analysis This provides a single source of truth for developers implementing or migrating investigations to the executor pattern. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
@bergmannf: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
feature
What this PR does / Why we need it?
This PR sets the groundwork to split the investigation and execution of resulting action into two separate modules.
Instead of having investigations do everything, they should just report correct actions based on findings which can then be performed at a later stage.
At this later stage it could also be decided if all or only some should be done / be done in parallel.
This is a very WIP PR and as an example it's moving the CHGM investigation to the new architecture.
Special notes for your reviewer
Test Coverage
Guidelines for CAD investigations
Test coverage checks
Pre-checks (if applicable)