-
Notifications
You must be signed in to change notification settings - Fork 462
Healthcheck Module
Health checks are essential for reliable ExaBGP deployments, ensuring routes are only announced when services are actually healthy. This guide covers implementing robust health check modules for various scenarios.
π Recommended Reading
Vincent Bernat's blog post: High availability with ExaBGP is an excellent real-world guide to production health check patterns and is highly recommended reading alongside this documentation.
βΉοΈ ACK Feature Note
- ExaBGP 4.x and 5.x: ACK is enabled by default in both versions.
- Health check scripts: Most examples in this guide are simple and don't read ACK responses for brevity.
For production deployments, you have three options:
Option 1 (Simpler): Disable ACK using environment variable - suitable for simple health checks
export exabgp.api.ack=falseOption 2 (ExaBGP 5.x/main - Runtime Control): Control ACK behavior dynamically via API commands:
disable-ack- Turn off ACK responses at runtimeenable-ack- Turn on ACK responses at runtimesilence-ack- Suppress ACK success messages (only show errors)Option 3 (Recommended for reliability): Read ACK responses in your health check script
All examples work on both 4.x and 5.x (you may want to disable ACK for simpler code).
- Overview
- Built-in Healthcheck Module β Recommended
- Basic Health Check Pattern
- Health Check Types
- Dampening and Flap Prevention
- Advanced Patterns
- Production Health Check Module
- Integration Examples
- Monitoring and Logging
- Common Pitfalls
- See Also
A health check module continuously monitors service health and controls BGP route announcements based on service state.
Key Principles:
- Rise/Fall Dampening: Require multiple consecutive passes/fails before changing state
- Timeout Handling: Health checks must have timeouts (don't hang indefinitely)
- Logging: Log all state changes for troubleshooting
- Graceful Degradation: Handle partial failures intelligently
Basic Flow:
[Health Check] β [Dampening Logic] β [BGP Announcement/Withdrawal]
β β β
Service State Rise/Fall Counters ExaBGP Route Control
β ExaBGP includes a production-ready healthcheck module that you can use without writing custom scripts.
Basic usage:
# /etc/exabgp/exabgp.conf
neighbor 192.0.2.1 {
router-id 192.0.2.2;
local-address 192.0.2.2;
local-as 65001;
peer-as 65000;
}
process watch-haproxy {
run python -m exabgp healthcheck --cmd "curl -sf http://127.0.0.1/health" --label haproxy;
}
process watch-mysql {
run python -m exabgp healthcheck --cmd "mysql -u check -e 'SELECT 1'" --label mysql;
}What this does:
- Runs health check command periodically (default: every 5 seconds)
- Announces IP addresses labeled
lo:haproxy*when check passes - Withdraws routes when check fails
- Handles IP address setup/teardown automatically
The built-in healthcheck module accepts options via command-line arguments or configuration file.
exabgp healthcheck --help
python -m exabgp healthcheck --helpCreate /etc/exabgp/healthcheck-haproxy.conf:
debug
name = haproxy
interval = 10
fast-interval = 1
command = curl -sf http://127.0.0.1/healthcheckUse in ExaBGP config:
process watch-haproxy {
run python -m exabgp healthcheck --config /etc/exabgp/healthcheck-haproxy.conf;
}Health check command to execute.
# HTTP check
--cmd "curl -sf http://127.0.0.1/health"
# TCP port check
--cmd "nc -z 127.0.0.1 3306"
# Custom script
--cmd "/usr/local/bin/check-service.sh"
# MySQL check
--cmd "mysql -u check -e 'SELECT 1'"
# Multi-step check
--cmd "sh -c 'curl -sf http://127.0.0.1/health && redis-cli ping'"Command exit codes:
- 0: Service healthy
- Non-zero: Service unhealthy
| Option | Default | Description |
|---|---|---|
--interval N, -i N
|
5 | Wait N seconds between health checks |
--fast-interval N, -f N
|
1 | Interval when state change is about to occur |
--timeout N, -t N
|
5 | Command execution timeout |
--rise N |
3 | Consecutive passes before considering service UP |
--fall N |
2 | Consecutive failures before considering service DOWN |
Example: Faster detection
--interval 2 --fast-interval 0.5 --timeout 2 --rise 2 --fall 2| Option | Description |
|---|---|
--disable FILE |
If FILE exists, service is considered disabled |
Use case: Manual service drain
# In ExaBGP config
--disable /var/run/exabgp-haproxy.disabled
# To drain service:
touch /var/run/exabgp-haproxy.disabled
# To re-enable:
rm /var/run/exabgp-haproxy.disabled| Option | Description |
|---|---|
--ip IP |
Advertise this IP address or network (CIDR notation) |
--ip-ifname IP%IFNAME |
Bind IP to specific interface (e.g., 192.168.1.1%eth0) |
--label LABEL |
Announce IPs with labels matching IFNAME:LABEL*
|
--label-exact-match |
Match label exactly (not as prefix) |
--start-ip N |
Index of first IP in list (default: 0) |
Examples:
Option 1: Explicit IP:
--ip 100.64.1.1/32Option 2: Label matching (recommended):
# Announce all IPs labeled lo:haproxy*
--label haproxy
# Matches:
# lo:haproxy1 (100.64.1.1/32)
# lo:haproxy2 (100.64.1.2/32)
# lo:haproxy3 (100.64.1.3/32)Option 3: Bind to specific interface:
--ip-ifname 100.64.1.1%lo| Option | Description |
|---|---|
--no-ip-setup |
Don't configure missing IP addresses on interfaces |
--dynamic-ip-setup |
Delete IPs when service DOWN/disabled, restore when UP |
--sudo |
Use sudo for IP address operations |
| Option | Description |
|---|---|
--next-hop IP, -N IP
|
Self IP to use as BGP next-hop |
--local-preference P |
LOCAL_PREF value for announced routes |
| Option | Default | Description |
|---|---|---|
--up-metric M |
100 | MED when service is UP |
--down-metric M |
1000 | MED when service is DOWN |
--disabled-metric M |
500 | MED when service is disabled |
--increase M |
0 | Increment MED for each additional IP |
Example: Metric-based failover
# Primary server: low MED when healthy
--up-metric 100 --down-metric 1000
# Backup server: higher MED
--up-metric 200 --down-metric 1100| Option | Description |
|---|---|
--community C |
Announce with standard community |
--extended-community EC |
Announce with extended community |
--large-community LC |
Announce with large community |
--disabled-community C |
Community to use when disabled |
Example:
--community 65001:100 --community 65001:200| Option | Description |
|---|---|
--as-path ASPATH |
AS-PATH for all states |
--up-as-path ASPATH |
AS-PATH when service UP |
--down-as-path ASPATH |
AS-PATH when service DOWN |
--disabled-as-path ASPATH |
AS-PATH when service disabled |
Example: Prepend when down
--up-as-path "65001" --down-as-path "65001 65001 65001"| Option | Description |
|---|---|
--withdraw-on-down |
Withdraw route instead of increasing MED on failure |
--deaggregate-networks |
Deaggregate networks specified in --ip
|
| Option | Description |
|---|---|
--path-id PATHID |
BGP ADD-PATH path ID |
--neighbor NEIGHBOR |
Advertise only to selected neighbors |
--debounce |
Announce only on state changes (not every iteration) |
Execute commands when service state changes:
| Option | Description |
|---|---|
--execute CMD |
Execute on any state change |
--up-execute CMD |
Execute when service becomes UP |
--down-execute CMD |
Execute when service becomes DOWN |
--disabled-execute CMD |
Execute when service disabled |
Examples:
Send alert when service goes down:
--down-execute "mail -s 'Service DOWN' [email protected]"Update monitoring system:
--up-execute "/usr/local/bin/update-monitoring UP" \
--down-execute "/usr/local/bin/update-monitoring DOWN"Slack notification:
--down-execute "curl -X POST https://hooks.slack.com/... -d '{\"text\":\"HAProxy DOWN\"}'"# /etc/exabgp/exabgp.conf
neighbor 192.0.2.1 {
router-id 192.0.2.2;
local-address 192.0.2.2;
local-as 65001;
peer-as 65000;
}
process watch-web {
run python -m exabgp healthcheck \
--cmd "curl -sf http://127.0.0.1:80/health" \
--label web \
--interval 5 \
--rise 3 \
--fall 2;
}IP setup on loopback:
ip addr add 100.64.1.1/32 dev lo label lo:web1
ip addr add 100.64.1.2/32 dev lo label lo:web2process watch-mysql {
run python -m exabgp healthcheck \
--cmd "mysql -u healthcheck -e 'SELECT 1'" \
--ip 100.64.2.1/32 \
--up-metric 100 \
--down-metric 1000 \
--rise 3 \
--fall 2 \
--community 65001:100;
}# Primary HAProxy
process watch-haproxy-primary {
run python -m exabgp healthcheck \
--cmd "curl -sf http://127.0.0.1:8080/health" \
--ip 100.64.10.1/32 \
--up-metric 100 \
--down-metric 1000 \
--community 65001:primary;
}
# Backup HAProxy (higher MED)
process watch-haproxy-backup {
run python -m exabgp healthcheck \
--cmd "curl -sf http://127.0.0.2:8080/health" \
--ip 100.64.10.2/32 \
--up-metric 200 \
--down-metric 1100 \
--community 65001:backup;
}process watch-dns {
run python -m exabgp healthcheck \
--cmd "dig @127.0.0.1 example.com +short" \
--ip 8.8.8.8/32 \
--withdraw-on-down \
--rise 2 \
--fall 2;
}process watch-api {
run python -m exabgp healthcheck \
--cmd "/usr/local/bin/check-api.sh" \
--label api \
--disable /var/run/exabgp-api.disabled \
--down-execute "logger 'API service DOWN - route withdrawn'" \
--up-execute "logger 'API service UP - route announced'" \
--debounce;
}/etc/exabgp/healthcheck-haproxy.conf:
# Logging
debug
syslog-facility = local0
# Naming
name = haproxy-primary
# Health check
interval = 5
fast-interval = 1
timeout = 3
rise = 3
fall = 2
command = curl -sf http://127.0.0.1:8080/health
# Advertising
label = haproxy
up-metric = 100
down-metric = 1000
community = 65001:100
withdraw-on-down
# Execution hooks
down-execute = /usr/local/bin/alert-down.sh
up-execute = /usr/local/bin/alert-up.shExaBGP config:
process watch-haproxy {
run python -m exabgp healthcheck --config /etc/exabgp/healthcheck-haproxy.conf;
}| Feature | Built-in Healthcheck | Custom Script |
|---|---|---|
| Setup | Zero code required | Write Python/Bash script |
| Rise/Fall dampening | β Built-in | Must implement manually |
| IP setup | β Automatic | Must implement manually |
| Logging | β Syslog support | Must implement manually |
| Metrics/MED | β Built-in | Must implement manually |
| Execution hooks | β Built-in | Must implement manually |
| Flexibility | Limited to options | Unlimited |
| Complexity | Simple | Custom logic possible |
Recommendation:
- Use built-in healthcheck for 90% of use cases (HTTP, TCP, command-based checks)
- Use custom script only when you need complex logic (multi-step checks, weighted decisions, etc.)
#!/usr/bin/env python3
"""
Basic health check module for ExaBGP
Announces route when service is healthy, withdraws when unhealthy
"""
import sys
import time
import subprocess
import logging
# Configuration
SERVICE_IP = "100.64.1.1/32"
CHECK_INTERVAL = 5 # seconds
RISE_THRESHOLD = 3 # consecutive passes before announcing
FALL_THRESHOLD = 2 # consecutive failures before withdrawing
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s: %(message)s',
handlers=[
logging.FileHandler('/var/log/exabgp-healthcheck.log'),
logging.StreamHandler(sys.stderr)
]
)
def check_service_health():
"""
Check if service is healthy
Returns True if healthy, False otherwise
"""
try:
# Example: HTTP health check
result = subprocess.run(
['curl', '-sf', 'http://localhost/health'],
timeout=2,
capture_output=True
)
return result.returncode == 0
except subprocess.TimeoutExpired:
logging.warning("Health check timed out")
return False
except Exception as e:
logging.error(f"Health check error: {e}")
return False
def announce_route():
"""Announce BGP route"""
print(f"announce route {SERVICE_IP} next-hop 192.0.2.1")
sys.stdout.flush()
logging.info(f"Announced route {SERVICE_IP}")
def withdraw_route():
"""Withdraw BGP route"""
print(f"withdraw route {SERVICE_IP}")
sys.stdout.flush()
logging.warning(f"Withdrew route {SERVICE_IP}")
def main():
rise_count = 0
fall_count = 0
announced = False
logging.info("Health check module started")
while True:
healthy = check_service_health()
if healthy:
rise_count += 1
fall_count = 0
if rise_count >= RISE_THRESHOLD and not announced:
announce_route()
announced = True
rise_count = 0
else:
fall_count += 1
rise_count = 0
if fall_count >= FALL_THRESHOLD and announced:
withdraw_route()
announced = False
fall_count = 0
time.sleep(CHECK_INTERVAL)
if __name__ == '__main__':
main()# /etc/exabgp/healthcheck.conf
neighbor 192.0.2.1 {
router-id 192.0.2.2;
local-address 192.0.2.2;
local-as 65001;
peer-as 65000;
family {
ipv4 unicast;
}
api {
processes [ healthcheck ];
}
}
process healthcheck {
run /usr/local/bin/exabgp-healthcheck.py;
encoder text;
}Use Case: Web servers, APIs, load balancers
import requests
def http_health_check(url, timeout=2):
"""
Check HTTP endpoint
Returns True if status code 200 and (optionally) response matches pattern
"""
try:
response = requests.get(url, timeout=timeout)
return response.status_code == 200
except requests.exceptions.RequestException:
return False
# With content verification
def http_health_check_advanced(url, expected_text="OK", timeout=2):
"""Check HTTP endpoint with content verification"""
try:
response = requests.get(url, timeout=timeout)
return response.status_code == 200 and expected_text in response.text
except requests.exceptions.RequestException:
return False
# Example usage
healthy = http_health_check("http://localhost:8080/health")
healthy = http_health_check_advanced("https://localhost/status", expected_text='"status":"up"')Use Case: Databases, message queues, generic TCP services
import socket
def tcp_port_check(host, port, timeout=2):
"""
Check if TCP port is open and accepting connections
"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout)
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except Exception:
return False
# Example usage
healthy = tcp_port_check("localhost", 3306) # MySQL
healthy = tcp_port_check("localhost", 5432) # PostgreSQL
healthy = tcp_port_check("localhost", 6379) # RedisUse Case: Network reachability, simple aliveness
import subprocess
def ping_check(host, count=1, timeout=2):
"""
Ping host and return True if reachable
"""
try:
result = subprocess.run(
['ping', '-c', str(count), '-W', str(timeout), host],
timeout=timeout + 1,
capture_output=True
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
except Exception:
return False
# Example usage
healthy = ping_check("192.168.1.1")Use Case: Custom check scripts, database queries, file checks
import subprocess
def command_check(command, timeout=5):
"""
Execute command and return True if exit code is 0
"""
try:
result = subprocess.run(
command,
shell=True if isinstance(command, str) else False,
timeout=timeout,
capture_output=True
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
except Exception:
return False
# Examples
healthy = command_check("systemctl is-active nginx")
healthy = command_check(["mysql", "-e", "SELECT 1"])
healthy = command_check("test -f /var/run/myapp.pid")Use Case: Multiple services must all be healthy
def multi_service_check():
"""
Check multiple services - all must be healthy
"""
checks = {
'nginx': lambda: http_health_check("http://localhost:80"),
'redis': lambda: tcp_port_check("localhost", 6379),
'app': lambda: http_health_check("http://localhost:8080/health"),
}
results = {}
for name, check_func in checks.items():
results[name] = check_func()
if not results[name]:
logging.warning(f"Service {name} is unhealthy")
all_healthy = all(results.values())
logging.info(f"Health check results: {results}, all healthy: {all_healthy}")
return all_healthyProblem: Transient failures cause route flapping.
Solution: Require multiple consecutive passes/fails.
class HealthCheckDampener:
"""Dampening logic for health checks"""
def __init__(self, rise_threshold=3, fall_threshold=2):
self.rise_threshold = rise_threshold
self.fall_threshold = fall_threshold
self.rise_count = 0
self.fall_count = 0
self.state = 'down' # Current state: 'up' or 'down'
def update(self, healthy):
"""
Update health state based on check result
Returns True if state changed
"""
previous_state = self.state
if healthy:
self.rise_count += 1
self.fall_count = 0
if self.rise_count >= self.rise_threshold:
self.state = 'up'
self.rise_count = 0
else:
self.fall_count += 1
self.rise_count = 0
if self.fall_count >= self.fall_threshold:
self.state = 'down'
self.fall_count = 0
return self.state != previous_state
def is_up(self):
"""Return True if state is 'up'"""
return self.state == 'up'
# Usage
dampener = HealthCheckDampener(rise_threshold=3, fall_threshold=2)
while True:
healthy = check_service_health()
state_changed = dampener.update(healthy)
if state_changed:
if dampener.is_up():
announce_route()
else:
withdraw_route()
time.sleep(5)Use different thresholds for bringing route up vs down:
RISE_THRESHOLD = 3 # Require 3 passes to announce (cautious)
FALL_THRESHOLD = 2 # Only 2 failures to withdraw (fast failover)Rationale:
- Higher rise threshold: Avoid announcing prematurely after restart
- Lower fall threshold: Fail fast when service actually dies
Use Case: Different checks have different importance.
def weighted_health_check():
"""
Weighted health checks - return True if score > threshold
"""
checks = {
'critical': {
'app_health': {'weight': 10, 'check': lambda: http_health_check("http://localhost:8080/health")},
'database': {'weight': 10, 'check': lambda: tcp_port_check("localhost", 5432)},
},
'important': {
'cache': {'weight': 5, 'check': lambda: tcp_port_check("localhost", 6379)},
},
'optional': {
'monitoring': {'weight': 1, 'check': lambda: tcp_port_check("localhost", 9090)},
}
}
total_score = 0
max_score = 0
for category, items in checks.items():
for name, config in items.items():
max_score += config['weight']
if config['check']():
total_score += config['weight']
else:
logging.warning(f"Check {name} ({category}) failed")
health_percentage = (total_score / max_score) * 100 if max_score > 0 else 0
healthy = health_percentage >= 80 # Require 80% score
logging.info(f"Health score: {total_score}/{max_score} ({health_percentage:.1f}%)")
return healthyUse Case: Service A depends on Service B.
def dependency_check():
"""
Check dependencies in order - fail fast if dependency fails
"""
# Check critical dependencies first
if not tcp_port_check("localhost", 5432): # Database
logging.error("Database down - service cannot function")
return False
if not tcp_port_check("localhost", 6379): # Cache
logging.error("Cache down - service cannot function")
return False
# Only check app if dependencies are up
if not http_health_check("http://localhost:8080/health"):
logging.error("App health check failed")
return False
return TrueUse Case: Announce with higher MED when degraded (not fully healthy).
def graceful_degradation_check():
"""
Return health status with degradation level
Returns: ('healthy', 'degraded', or 'down'), med_value
"""
# Check critical services
app_ok = http_health_check("http://localhost:8080/health")
db_ok = tcp_port_check("localhost", 5432)
# Check optional services
cache_ok = tcp_port_check("localhost", 6379)
if app_ok and db_ok and cache_ok:
return ('healthy', 100) # MED 100 - fully healthy
elif app_ok and db_ok:
return ('degraded', 150) # MED 150 - degraded (no cache)
else:
return ('down', None) # Completely down
# Usage
while True:
status, med = graceful_degradation_check()
if status == 'healthy':
print(f"announce route {SERVICE_IP} next-hop 192.0.2.1 med {med}")
sys.stdout.flush()
elif status == 'degraded':
print(f"announce route {SERVICE_IP} next-hop 192.0.2.1 med {med}")
sys.stdout.flush()
logging.warning("Service degraded - announcing with higher MED")
elif status == 'down':
print(f"withdraw route {SERVICE_IP}")
sys.stdout.flush()
time.sleep(10)Complete production-ready health check module with all features:
#!/usr/bin/env python3
"""
Production Health Check Module for ExaBGP
Features:
- Multiple check types (HTTP, TCP, command)
- Rise/fall dampening
- Weighted checks
- Graceful degradation with MED
- Comprehensive logging
- Signal handling
"""
import sys
import time
import signal
import logging
import subprocess
import socket
from typing import Dict, Callable, Tuple, Optional
# Configuration
CONFIG = {
'service_ip': '100.64.1.1/32',
'check_interval': 5,
'rise_threshold': 3,
'fall_threshold': 2,
'log_file': '/var/log/exabgp-healthcheck.log',
}
# Health checks configuration
CHECKS = {
'app_http': {
'type': 'http',
'url': 'http://localhost:8080/health',
'weight': 10,
'timeout': 2,
},
'database': {
'type': 'tcp',
'host': 'localhost',
'port': 5432,
'weight': 10,
'timeout': 2,
},
'cache': {
'type': 'tcp',
'host': 'localhost',
'port': 6379,
'weight': 5,
'timeout': 2,
},
}
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s: %(message)s',
handlers=[
logging.FileHandler(CONFIG['log_file']),
logging.StreamHandler(sys.stderr)
]
)
# Global shutdown flag
shutdown_flag = False
def signal_handler(signum, frame):
"""Handle shutdown signals gracefully"""
global shutdown_flag
logging.info(f"Received signal {signum}, shutting down gracefully")
shutdown_flag = True
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
def http_check(url: str, timeout: int = 2) -> bool:
"""HTTP health check"""
try:
import requests
response = requests.get(url, timeout=timeout)
return response.status_code == 200
except Exception as e:
logging.debug(f"HTTP check failed for {url}: {e}")
return False
def tcp_check(host: str, port: int, timeout: int = 2) -> bool:
"""TCP port check"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout)
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except Exception as e:
logging.debug(f"TCP check failed for {host}:{port}: {e}")
return False
def command_check(command: str, timeout: int = 5) -> bool:
"""Command execution check"""
try:
result = subprocess.run(
command,
shell=True,
timeout=timeout,
capture_output=True
)
return result.returncode == 0
except Exception as e:
logging.debug(f"Command check failed for '{command}': {e}")
return False
def run_checks() -> Tuple[bool, int]:
"""
Run all configured health checks
Returns: (healthy: bool, med: int)
"""
total_weight = sum(check['weight'] for check in CHECKS.values())
current_weight = 0
for name, config in CHECKS.items():
check_type = config['type']
passed = False
if check_type == 'http':
passed = http_check(config['url'], config.get('timeout', 2))
elif check_type == 'tcp':
passed = tcp_check(config['host'], config['port'], config.get('timeout', 2))
elif check_type == 'command':
passed = command_check(config['command'], config.get('timeout', 5))
if passed:
current_weight += config['weight']
else:
logging.warning(f"Check '{name}' failed")
health_percentage = (current_weight / total_weight) * 100 if total_weight > 0 else 0
# Determine health status and MED
if health_percentage >= 90:
# Fully healthy
return (True, 100)
elif health_percentage >= 70:
# Degraded but functional
return (True, 150)
else:
# Too degraded, withdraw
return (False, None)
class HealthState:
"""Track health state with dampening"""
def __init__(self, rise_threshold: int, fall_threshold: int):
self.rise_threshold = rise_threshold
self.fall_threshold = fall_threshold
self.rise_count = 0
self.fall_count = 0
self.announced = False
self.current_med = None
def update(self, healthy: bool, med: Optional[int]) -> bool:
"""
Update state based on check result
Returns True if announcement state should change
"""
if healthy:
self.rise_count += 1
self.fall_count = 0
if self.rise_count >= self.rise_threshold or self.announced:
# Announce or update MED
should_update = not self.announced or self.current_med != med
self.announced = True
self.current_med = med
self.rise_count = 0
return should_update
else:
self.fall_count += 1
self.rise_count = 0
if self.fall_count >= self.fall_threshold and self.announced:
# Withdraw
self.announced = False
self.current_med = None
self.fall_count = 0
return True
return False
def announce_route(med: int):
"""Announce BGP route with MED"""
cmd = f"announce route {CONFIG['service_ip']} next-hop self med {med}"
print(cmd)
sys.stdout.flush()
logging.info(f"Announced route with MED {med}")
def withdraw_route():
"""Withdraw BGP route"""
cmd = f"withdraw route {CONFIG['service_ip']} next-hop self"
print(cmd)
sys.stdout.flush()
logging.warning("Withdrew route")
def main():
"""Main health check loop"""
logging.info("Production health check module started")
state = HealthState(CONFIG['rise_threshold'], CONFIG['fall_threshold'])
while not shutdown_flag:
healthy, med = run_checks()
should_update = state.update(healthy, med)
if should_update:
if state.announced:
announce_route(state.current_med)
else:
withdraw_route()
time.sleep(CONFIG['check_interval'])
# Graceful shutdown - withdraw route
if state.announced:
logging.info("Shutting down - withdrawing route")
withdraw_route()
logging.info("Health check module stopped")
if __name__ == '__main__':
main()Monitor HAProxy backend health:
import requests
def haproxy_backend_check(stats_url, backend_name):
"""Check if HAProxy backend has at least one UP server"""
try:
response = requests.get(f"{stats_url};csv")
lines = response.text.split('\n')
for line in lines:
if backend_name in line and ',UP,' in line:
return True
return False
except:
return False
# Usage
healthy = haproxy_backend_check("http://localhost:8404/stats", "webservers")Check pod readiness:
import subprocess
import json
def kubernetes_pod_ready(namespace, app_label):
"""Check if at least one pod with app label is ready"""
try:
result = subprocess.run(
['kubectl', 'get', 'pods', '-n', namespace,
'-l', f'app={app_label}', '-o', 'json'],
timeout=5,
capture_output=True
)
if result.returncode != 0:
return False
pods = json.loads(result.stdout)
for pod in pods.get('items', []):
conditions = pod.get('status', {}).get('conditions', [])
for condition in conditions:
if condition['type'] == 'Ready' and condition['status'] == 'True':
return True
return False
except:
return False
# Usage
healthy = kubernetes_pod_ready("default", "myapp")Export health check metrics for Prometheus:
from prometheus_client import Gauge, Counter, start_http_server
# Metrics
health_status = Gauge('exabgp_health_status', 'Current health status (1=up, 0=down)')
check_duration = Gauge('exabgp_check_duration_seconds', 'Health check duration')
state_changes = Counter('exabgp_state_changes_total', 'Total state changes', ['from_state', 'to_state'])
# Start metrics server
start_http_server(9100)
# Update metrics
health_status.set(1 if healthy else 0)
check_duration.set(duration)
state_changes.labels(from_state='down', to_state='up').inc()Use structured logging for better analysis:
import json
import logging
class JSONFormatter(logging.Formatter):
def format(self, record):
log_obj = {
'timestamp': self.formatTime(record),
'level': record.levelname,
'message': record.getMessage(),
}
return json.dumps(log_obj)
handler = logging.FileHandler('/var/log/exabgp-healthcheck.json')
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)
logging.info("Health check passed", extra={'check': 'http', 'url': 'http://localhost:8080'})- No timeout on checks: Always set timeouts (2-5 seconds typical)
- No dampening: Causes route flapping on transient failures
- Blocking checks: Use subprocess.run with timeout, not os.system
- Forgot sys.stdout.flush(): Commands buffer and don't reach ExaBGP
- No logging: Impossible to troubleshoot when things break
- Checking too frequently: Every 5-10 seconds is usually sufficient
- Not handling shutdown gracefully: Routes not withdrawn on stop
- Service High Availability - HA patterns
- Anycast Management - Anycast with health checks
- Production Best Practices - Production deployment
- Common Pitfalls - Common mistakes to avoid
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)