Add Email Notification System for scrontab-Launched Rocoto Workflows#4458
Conversation
- Add rocoto_monitor_notify.sh: wrapper script for rocotorun with failure/stall detection - Add setup_scrontab_with_notifications.sh: helper to configure scrontab entries - Add ROCOTO_EMAIL_NOTIFICATIONS.md: comprehensive documentation Features: - Email notifications for failed jobs (one per unique failure state) - Stall detection for workflows with no progress >1 hour - Spam prevention using lock files - Integration with SLURM --mail-type=FAIL - Automatic cleanup of old notification locks Addresses requirement for scrontab error reporting via email
…rkflows" This reverts commit b81e3ea.
…NOAA/global-workflow into feature/scrontab
There was a problem hiding this comment.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
shellcheck
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Useless cat. Consider 'cmd < file | ..' or 'cmd file | ..' instead. SC2002
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
I repeated testing on both Ursa and Gaea and the email was set properly in both cases. Emails were sent on Gaea following a manually triggered failure. Merging. |
…OAA-EMC#4458) Implements automated email notifications when jobs fail in scrontab-launched Rocoto experiments
…ails in v17 (#4574) This is a cherry-pick of PRs #4545 and #4458 into dev/gfs.v17. This - fixes a bug that prevents rerunning atmospheric analyses (`gdas_anal`, `gfs_anal`, or `enkfgdas_eobs`) - moves to an improved and NCO-approved method of linking instead of moving the gsidiag files at the end of the jobs - enables scrontab emailing on Gaea C6
Description
Implements automated email notifications when jobs fail in scrontab-launched Rocoto experiments
Resolves Notify users when scrontab-based experiments fail #3486
✓ Immediate Awareness: Users are notified promptly when jobs fail
✓ Reduced Alert Fatigue: Only NEW failures trigger emails
✓ Actionable Information: Includes job details and error log paths
✓ Platform Compatible: Gaea
01/20/26 14:30:15 UTC :: experiment.xml :: Cycle 2021122100, Task gfs_atmos_prod, jobid=12345, in state DEAD, ran for 3600 seconds, try=1 (of 2) Error log: /path/to/comroot/experiment/logs/2021122100/gfs_atmos_prod.logType of change
Change characteristics
How has this been tested?
Checklist