This script automates scaling Jenkins build agents on AWS EC2 based on build queue length, wait times, and ongoing builds.
It is managed by a Linux cron service. The cron job is scheduled to run every minute on the Jenkins Controller instance.
Update: Slack notifications have been limited to only Critical System Failures and also rate limited to a single message per hour per error type to prevent Alert Fatigue.
- Monitors Jenkins build queue length
- Triggers scaling based on queue length, starts:
- agent 1 when queue length > 0
- agent 2 when queue length > 3
- agent 3 when queue length > 6
- agent 4 when queue length > 9
- agent 5 when queue length > 12
- Monitors how long jobs have been waiting, starts
- agent 1 when there's a build in the queue
- agent 2 when wait time > 2 mins
- agent 3 when wait time > 5 mins
- agent 4 when wait time > 8 mins
- agent 5 when wait time > 11 mins
- Monitors 5 EC2 instances as Jenkins agents
- Tracks agent states:
- EC2 instance status (running/stopped)
- Jenkins connection status (connected/disconnected)
- Build activity (idle/busy)
- Automatically enables agents if they're temporarily offline
- Checks for running builds before shutdown
- Two shutdown modes:
- Immediate: If no active builds
- Graceful: Waits for builds to complete
- Force shutdown after MAX_GRACEFUL_SHUTDOWN_MINUTES (60 minutes)
- Implements COOLDOWN_MINUTES (3 minutes) between scaling actions
- Prevents rapid start/stop cycles
- Still allows shutdown checks during cooldown
Condition:
- Queue length > 5 builds OR
- Wait time > 2 minutes
Action:
- Starts 3 agents if stopped
- Enables agents if temporarily offline
Condition:
- Queue length > 0 but ≤ 3 OR
- Wait time < 2 minutes
Action:
- Starts one agent (if stopped)
- Keeps second agent stopped unless needed
Condition:
- Queue length = 0 AND
- No ongoing builds
Action:
- Initiates shutdown process for idle running agents
- Checks for running builds before shutdown
Condition:
- Scale down triggered but builds running
Action:
- Marks agent as offline (no new builds) if there are no builds in the queue
- Creates pending shutdown record
- Monitors build completion
- Forces shutdown after 60 minutes if builds don't complete and sends a slack notification
Condition:
- Agent temporarily offline after startup
Action:
- Detects offline status
- Attempts to enable agent
- Critical System Error conditions
- Force Shutdowns
- Detailed logs in /var/log/jenkins-autoscale.log
- Queue length
- Build status
- Agent states
- Error conditions
Note: Later, fluentbit will be configured to send logs to our Elasticsearch Database.
- Jenkins API connection failures
- AWS API errors
- Agent connection issues
- Invalid JSON responses
- Failed scaling actions
AWS CLI
configured with right permissions.jq
for JSON parsingCurl
for API requests- Jenkins API access
- Slack webhook for notifications