Skip to content

Conversation

@EronWright
Copy link
Contributor

@EronWright EronWright commented Oct 23, 2025

Overview

This PR implements the foundational infrastructure for drift detection as outlined in #1037. Drift detection allows users to monitor when cloud resources diverge from their desired state defined in Stack CRs, with optional automatic remediation.

Changes

API/CRD Enhancements

  • ✅ Added DriftDetectionSpec to configure drift detection schedules with cron expressions
  • ✅ Added DriftDetectionStatus to track last drift check timestamp
  • ✅ Added DriftDetected condition to Stack status with Changes and NoChanges reasons
  • ✅ Added previewOnly field to Update CRD for non-destructive refresh operations

Protocol Buffer Updates

  • ✅ Added preview_only field to RefreshRequest message
  • ✅ Regenerated proto code for agent gRPC interface

Agent Implementation

  • ✅ Updated Refresh() method to handle preview_only flag
  • ⚠️ Currently uses RunProgram(false) as a workaround
  • 📝 TODO: Should be updated to use Stack.PreviewRefresh() instead

Note on Automation API: The Pulumi Automation API already has Stack.PreviewRefresh() which provides the exact functionality we need for non-destructive drift detection. The current implementation uses optrefresh.RunProgram(false) as a workaround, but should be updated to use the proper PreviewRefresh() method which:

  • Returns a PreviewResult with ChangeSummary instead of UpdateSummary
  • Doesn't modify the state file
  • Provides the same semantics as pulumi refresh --preview-only

Controller Logic

  • ✅ Created newDriftDetection() helper to generate drift detection Update CRs
  • ✅ Enhanced markStackSucceeded() to detect and handle drift detection results
  • ✅ Parses refresh output to determine if drift occurred
  • ✅ Sets appropriate DriftDetected condition based on results
  • ✅ Emits StackDriftDetected Kubernetes events

Code Generation

  • ✅ Updated CRD manifests in deploy/crds/ and deploy/helm/
  • ✅ Regenerated deepcopy methods and apply configurations
  • ✅ Updated API documentation in docs/

Example Usage

Once scheduling is implemented, users will be able to configure drift detection like this:

apiVersion: pulumi.com/v1
kind: Stack
metadata:
  name: my-stack
spec:
  stack: org/project/stack
  projectRepo: https://github.com/example/repo
  driftDetection:
    schedules:
      - cron: "*/15 * * * *"  # Check every 15 minutes
        autoRemediate: false

Current Limitations

This PR provides foundational infrastructure. The following items are not yet implemented:

  • ⏸️ Update to use PreviewRefresh - Should replace RunProgram(false) workaround with proper Stack.PreviewRefresh() call
  • ⏸️ Cron-based scheduling logic - Drift detection Updates must currently be manually triggered
  • ⏸️ Auto-remediation workflow - When drift is detected and autoRemediate: true, should automatically create an up Update
  • ⏸️ Integration tests - Need comprehensive test coverage for drift detection scenarios
  • ⏸️ Example Stack CRs - Need examples demonstrating drift detection usage

Testing

  • ✅ Operator builds successfully
  • ✅ Agent builds successfully
  • ✅ Code generation completes without errors
  • ⏸️ Manual testing needed
  • ⏸️ Unit tests needed
  • ⏸️ E2E tests needed

Next Steps

Before this PR is ready for review:

  1. Update agent to use Stack.PreviewRefresh() instead of RunProgram(false)
  2. Implement cron-based scheduling in StackReconciler
  3. Implement auto-remediation workflow
  4. Add unit tests for drift detection logic
  5. Add integration/e2e tests
  6. Create example Stack CRs
  7. Update changelog
  8. Manual testing with real stacks

Related


Note: This is a draft PR to demonstrate the implementation approach. Feedback welcome on the API design and implementation direction before completing the remaining work.

🤖 Generated with Claude Code

This commit implements the foundational infrastructure for drift detection
as outlined in issue #1037. Drift detection allows users to monitor when
cloud resources diverge from their desired state defined in Stack CRs.

## Changes

### API/CRD Changes:
- Add DriftDetectionSpec to configure drift detection schedules
- Add DriftDetectionStatus to track last drift check time
- Add DriftDetected condition to Stack status
- Add previewOnly field to Update CRD for non-destructive refresh

### Protocol Buffer Changes:
- Add preview_only field to RefreshRequest message
- Regenerate proto code

### Agent Changes:
- Update Refresh() to support preview-only mode
- Use RunProgram(false) option when preview_only is requested

### Controller Changes:
- Add newDriftDetection() helper to create drift detection Updates
- Update markStackSucceeded() to handle drift detection results
- Parse refresh results to set DriftDetected condition
- Emit StackDriftDetected events when drift is found
- Update UpdateReconciler to pass preview_only flag to agent

### Code Generation:
- Update CRD manifests with new fields
- Regenerate deepcopy methods
- Update API documentation

## Limitations & Future Work

This is foundational infrastructure. Future work needed:
- [ ] Add cron-based scheduling logic (currently requires manual trigger)
- [ ] Implement auto-remediation workflow
- [ ] Wait for upstream Pulumi Automation API support for preview-only refresh
- [ ] Add comprehensive tests for drift detection scenarios
- [ ] Add example Stack CRs demonstrating drift detection usage

Related: #1037

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 0% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.24%. Comparing base (f2a3513) to head (03a57ff).
⚠️ Report is 47 commits behind head on master.

Files with missing lines Patch % Lines
...tor/internal/controller/pulumi/stack_controller.go 0.00% 51 Missing and 1 partial ⚠️
agent/pkg/server/server.go 0.00% 1 Missing and 1 partial ⚠️
operator/api/pulumi/v1/events.go 0.00% 2 Missing ⚠️
...ator/internal/controller/auto/update_controller.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1041      +/-   ##
==========================================
+ Coverage   53.03%   53.24%   +0.20%     
==========================================
  Files          34       34              
  Lines        4646     4089     -557     
==========================================
- Hits         2464     2177     -287     
+ Misses       1987     1702     -285     
- Partials      195      210      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@EronWright

This comment was marked as outdated.

@EronWright
Copy link
Contributor Author

EronWright commented Oct 23, 2025

Update on Automation API Support:

Good news! The Pulumi Automation API already has Stack.PreviewRefresh() which provides exactly what we need for non-destructive drift detection:

// Non-destructive refresh - does not modify state
previewResult, err := stack.PreviewRefresh(ctx, ...optrefresh.Option)

Key differences:

  • PreviewRefresh() returns PreviewResult (with ChangeSummary) instead of RefreshResult (with UpdateSummary)
  • Need to handle the different result types appropriately when converting to protobuf response
  • Semantically clearer and matches the CLI's --preview-only flag

This can be done in a follow-up commit before marking the PR ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant