-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix incident auto-resolve workflows not triggering #5382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix incident auto-resolve workflows not triggering #5382
Conversation
…d flexible handling - Add comprehensive DMARC report detection using multiple indicators (sender, subject, content-type) - Add email type classification (DMARC, SPF, bounce, alert) - Add configurable skip options for DMARC/SPF reports via UI - Handle emails without body content gracefully with fallback messages - Improve error handling and logging for better debugging - Add email_type metadata to all alerts for better tracking Fixes parsing errors for DMARC reports that have no body content.
Backend: - Add get_error_alerts_to_reprocess() helper function to db.py - Add dismiss_error_alert_by_id() helper function to db.py - Add POST /alerts/event/error/reprocess API endpoint - Support reprocessing single alert or all error alerts - Auto-dismiss successfully reprocessed alerts - Return detailed results with success/failure counts Frontend: - Add reprocessErrorAlerts() function to useAlerts hook - Add reprocess buttons to AlertErrorEventModal UI - Add handleReprocessSelected() and handleReprocessAll() handlers - Add loading states and toast notifications - Disable buttons during operations to prevent race conditions This allows users to reprocess failed alert events after code fixes (e.g., DMARC detection improvements). Successfully reprocessed alerts are automatically dismissed from the error alerts list.
Changed skip_dmarc_reports and skip_spf_reports defaults from True to False. DMARC and SPF reports will now create alerts by default. Users can still enable skipping via UI configuration if desired. DMARC reports without body will get message: DMARC Report: {subject} + attachment info
- Add _extract_severity_from_email() method for keyword-based severity detection - Detect critical, high, warning, low, and info severity from subject/body - Assign severity based on email type (DMARC=low, SPF/bounce=warning) - Priority keyword matching: critical > high > warning > low > info Examples: - DMARC reports: low severity (informational) - [SUCCESS] emails: low severity - [ERROR]/[CRITICAL] emails: high/critical severity - [WARNING] emails: warning severity This provides better alert prioritization in the UI with appropriate visual indicators.
- Add _extract_status_from_email() method for keyword-based status detection - Detect resolved, acknowledged, and firing status from subject/body - Support status transitions via email (e.g., resolved notifications) Status mapping: - resolved: resolved, cleared, recovered, fixed, closed, ok now, back to normal - acknowledged: acknowledged, ack, investigating, working on - firing: default for new alerts This allows email alerts to properly reflect their lifecycle status.
…sender Changed source field from email sender address to proper provider source format: - Primary source: mailgun (for source facet filtering) - Secondary source: email sender address (for detailed tracking) This fixes the source counter in alerts feed and allows proper filtering by source=mailgun. Before: source = [[email protected]] After: source = [mailgun, [email protected]]
Reverted the source field change. After review, the original behavior is correct: - source = [email_sender] allows filtering by specific email senders - This is the intended behavior for email-based providers - Users can filter by source to see alerts from specific monitoring systems The source counter showing individual email addresses is intentional. Users can use the email_type field to filter by provider (e.g., email_type=dmarc_report).
Add database script to refresh severity and status for Mailgun alerts that were processed before the intelligent extraction logic was added. Features: - Updates severity using keyword-based detection - Updates status using keyword-based detection - Adds email_type classification if missing - Dry-run mode by default (safe) - Configurable time range (default: 30 days) - Detailed reporting of changes - Error handling for individual alerts Usage: # Dry run (see what would change) python scripts/update_mailgun_alert_metadata.py --tenant-id keep # Actually update python scripts/update_mailgun_alert_metadata.py --tenant-id keep --apply # Check last 7 days only python scripts/update_mailgun_alert_metadata.py --tenant-id keep --days 7 --apply This allows retroactive updates for alerts processed before severity/status extraction improvements were added.
… values Reverted back to original hardcoded severity and status values: - severity = info (hardcoded) - status = firing (hardcoded) Removed: - _extract_severity_from_email() method - _extract_status_from_email() method - update_mailgun_alert_metadata.py script This matches the original Mailgun provider behavior where all email alerts have the same severity/status regardless of content.
Update auto-generated documentation to include new configuration fields: - skip_dmarc_reports: Skip DMARC reports - skip_spf_reports: Skip SPF reports - handle_emails_without_body: Handle emails without body content Generated using: python scripts/docs_render_provider_snippets.py
- Add workflow event trigger in resolve_incident_if_require() - Set end_time when incident auto-resolves - Update client via pusher on auto-resolve - Fixes issue where auto-resolved incidents didn't trigger workflows while manual resolution did
Skipped: This PR does not target one of your configured branches: ( |
@sanyo4ever is attempting to deploy a commit to the KeepHQ Team on Vercel. A member of the Team first needs to authorize it. |
No linked issues found. Please add the corresponding issues in the pull request description. |
|
Hey there and thank you for opening this pull request! 👋🏼 We require pull request titles to follow the Conventional Commits specification and it looks like your proposed title needs to be adjusted. Details:
|
Pull Request: Fix incident auto-resolve workflows not triggering
🐛 Problem
When incidents are automatically resolved (via the "resolve when all alerts resolve" setting), workflow triggers with type: incident and events: [updated] were not being executed. This only affected auto-resolution; manual resolution worked correctly.
🔍 Root Cause
The resolve_incident_if_require() method in keep/api/bl/incidents_bl.py (lines 432-472) was updating the incident status in the database but not calling send_workflow_event() to trigger workflows.
Compare with:
change_status() method (line 474+) - triggers workflows ✅
__postprocess_alerts_change() method (line 338) - triggers workflows ✅
resolve_incident_if_require() - missing workflow trigger ❌
✅ Solution
Added three missing operations to resolve_incident_if_require() after the incident status is updated:
Set end_time - Records when the incident was resolved (consistent with manual resolution)
Trigger workflows - Calls send_workflow_event(incident_dto, "updated") to execute incident workflows
Update clients - Calls update_client_on_incident_change() to notify UI via Pusher
📝 Changes
File: keep/api/bl/incidents_bl.py
Lines modified: 460-474
Lines added: 13
🧪 Testing
Before fix:
Created incident from UptimeKuma alert
Enabled "Resolve incident when all alerts resolve"
Alert resolved → Incident auto-resolved → Workflows did not trigger ❌
After fix:
Same scenario → Incident auto-resolved → Workflows triggered successfully ✅
Verified logs show: "Incident auto-resolved, triggering workflows"
📊 Impact
Low risk: Only adds functionality, doesn't change existing behavior
Backward compatible: Existing workflows will now work as expected
Consistent: Makes auto-resolve behave identically to manual resolve
🔗 Related
This ensures parity between manual and automatic incident resolution, allowing users to reliably trigger notifications (Slack, PagerDuty, etc.) when incidents are resolved via automation.