Skip to content

Add Email Notification System for scrontab-Launched Rocoto Workflows#4458

Merged
DavidHuber-NOAA merged 99 commits intoNOAA-EMC:developfrom
AntonMFernando-NOAA:feature/scrontab
Jan 30, 2026
Merged

Add Email Notification System for scrontab-Launched Rocoto Workflows#4458
DavidHuber-NOAA merged 99 commits intoNOAA-EMC:developfrom
AntonMFernando-NOAA:feature/scrontab

Conversation

@AntonMFernando-NOAA
Copy link
Contributor

@AntonMFernando-NOAA AntonMFernando-NOAA commented Jan 20, 2026

Description

✓ Immediate Awareness: Users are notified promptly when jobs fail
✓ Reduced Alert Fatigue: Only NEW failures trigger emails
✓ Actionable Information: Includes job details and error log paths
✓ Platform Compatible: Gaea

  • Email Formatting: Provides actionable information:
    01/20/26 14:30:15 UTC :: experiment.xml :: Cycle 2021122100, Task gfs_atmos_prod, jobid=12345, in state DEAD, ran for 3600 seconds, try=1 (of 2) Error log: /path/to/comroot/experiment/logs/2021122100/gfs_atmos_prod.log

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this change expected to change outputs (e.g. value changes to existing outputs, new files stored in COM, files removed from COM, filename changes, additions/subtractions to archives)? NO (If YES, please indicate to which system(s))
    • GFS
    • GEFS
    • SFS
    • GCAFS
  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO (If YES, please add a link to any PRs that are pending.)
    • EMC verif-global
    • GDAS
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

  • ✓ Gaea (scrontab with special mail flags)
  • ✓ Hera (standard crontab)

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

AntonMFernando-NOAA and others added 26 commits January 7, 2026 20:29
- Add rocoto_monitor_notify.sh: wrapper script for rocotorun with failure/stall detection
- Add setup_scrontab_with_notifications.sh: helper to configure scrontab entries
- Add ROCOTO_EMAIL_NOTIFICATIONS.md: comprehensive documentation

Features:
- Email notifications for failed jobs (one per unique failure state)
- Stall detection for workflows with no progress >1 hour
- Spam prevention using lock files
- Integration with SLURM --mail-type=FAIL
- Automatic cleanup of old notification locks

Addresses requirement for scrontab error reporting via email
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

shellcheck

[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250

if [[ -n "$EMAIL" ]] && command -v mail &> /dev/null; then


[shellcheck] reported by reviewdog 🐶
Useless cat. Consider 'cmd < file | ..' or 'cmd file | ..' instead. SC2002

cat "$MSGFILE" | mail -r "${FROM_EMAIL}" -v -s "[{pslot}] Workflow Job Failures Detected" "${EMAIL}" 2>&1


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250

cat "$MSGFILE" | mail -r "${FROM_EMAIL}" -v -s "[{pslot}] Workflow Job Failures Detected" "${EMAIL}" 2>&1


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250

echo "$FAILED_JOBS" > "$LOCKFILE"


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250

echo "$FAILED_JOBS" > "$LOCKFILE"


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250

[[ -f "$LOCKFILE" ]] && rm -f "$LOCKFILE"


[shellcheck] reported by reviewdog 🐶
Prefer putting braces around variable references even when not strictly required. SC2250

[[ -f "$LOCKFILE" ]] && rm -f "$LOCKFILE"

@DavidHuber-NOAA DavidHuber-NOAA removed the CI-Gaeac6-Passed (cm) Manual CI passed on Gaea C6 label Jan 27, 2026
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@DavidHuber-NOAA
Copy link
Contributor

I repeated testing on both Ursa and Gaea and the email was set properly in both cases. Emails were sent on Gaea following a manually triggered failure. Merging.

@DavidHuber-NOAA DavidHuber-NOAA merged commit 4ad46d8 into NOAA-EMC:develop Jan 30, 2026
5 checks passed
@AntonMFernando-NOAA AntonMFernando-NOAA deleted the feature/scrontab branch January 30, 2026 15:34
DavidHuber-NOAA pushed a commit to DavidHuber-NOAA/global-workflow that referenced this pull request Feb 20, 2026
…OAA-EMC#4458)

Implements automated email notifications when jobs fail in
scrontab-launched Rocoto experiments
DavidHuber-NOAA added a commit that referenced this pull request Feb 26, 2026
…ails in v17 (#4574)

This is a cherry-pick of PRs #4545 and #4458 into dev/gfs.v17. This

- fixes a bug that prevents rerunning atmospheric analyses (`gdas_anal`,
`gfs_anal`, or `enkfgdas_eobs`)
- moves to an improved and NCO-approved method of linking instead of
moving the gsidiag files at the end of the jobs
- enables scrontab emailing on Gaea C6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notify users when scrontab-based experiments fail

3 participants