Skip to content

Healthcheck Module

Thomas Mangin edited this page Nov 15, 2025 · 8 revisions

Health Check Module

Health checks are essential for reliable ExaBGP deployments, ensuring routes are only announced when services are actually healthy. This guide covers implementing robust health check modules for various scenarios.

πŸ“š Recommended Reading

Vincent Bernat's blog post: High availability with ExaBGP is an excellent real-world guide to production health check patterns and is highly recommended reading alongside this documentation.

ℹ️ ACK Feature Note

  • ExaBGP 4.x and 5.x: ACK is enabled by default in both versions.
  • Health check scripts: Most examples in this guide are simple and don't read ACK responses for brevity.

For production deployments, you have three options:

  1. Option 1 (Simpler): Disable ACK using environment variable - suitable for simple health checks

    export exabgp.api.ack=false
  2. Option 2 (ExaBGP 5.x/main - Runtime Control): Control ACK behavior dynamically via API commands:

    • disable-ack - Turn off ACK responses at runtime
    • enable-ack - Turn on ACK responses at runtime
    • silence-ack - Suppress ACK success messages (only show errors)

    See ACK Runtime Control

  3. Option 3 (Recommended for reliability): Read ACK responses in your health check script

    See ACK Feature Guide

All examples work on both 4.x and 5.x (you may want to disable ACK for simpler code).

Table of Contents


Overview

A health check module continuously monitors service health and controls BGP route announcements based on service state.

Key Principles:

  1. Rise/Fall Dampening: Require multiple consecutive passes/fails before changing state
  2. Timeout Handling: Health checks must have timeouts (don't hang indefinitely)
  3. Logging: Log all state changes for troubleshooting
  4. Graceful Degradation: Handle partial failures intelligently

Basic Flow:

[Health Check] β†’ [Dampening Logic] β†’ [BGP Announcement/Withdrawal]
     ↓                   ↓                      ↓
  Service State    Rise/Fall Counters    ExaBGP Route Control

Built-in Healthcheck Module

⭐ ExaBGP includes a production-ready healthcheck module that you can use without writing custom scripts.

Quick Start (Built-in Healthcheck)

Basic usage:

# /etc/exabgp/exabgp.conf
neighbor 192.0.2.1 {
    router-id 192.0.2.2;
    local-address 192.0.2.2;
    local-as 65001;
    peer-as 65000;
}

process watch-haproxy {
    run python -m exabgp healthcheck --cmd "curl -sf http://127.0.0.1/health" --label haproxy;
}

process watch-mysql {
    run python -m exabgp healthcheck --cmd "mysql -u check -e 'SELECT 1'" --label mysql;
}

What this does:

  1. Runs health check command periodically (default: every 5 seconds)
  2. Announces IP addresses labeled lo:haproxy* when check passes
  3. Withdraws routes when check fails
  4. Handles IP address setup/teardown automatically

Configuration Options

The built-in healthcheck module accepts options via command-line arguments or configuration file.

Command-line Usage

exabgp healthcheck --help
python -m exabgp healthcheck --help

Configuration File

Create /etc/exabgp/healthcheck-haproxy.conf:

debug
name = haproxy
interval = 10
fast-interval = 1
command = curl -sf http://127.0.0.1/healthcheck

Use in ExaBGP config:

process watch-haproxy {
    run python -m exabgp healthcheck --config /etc/exabgp/healthcheck-haproxy.conf;
}

Health Check Commands

--command, --cmd, -c CMD

Health check command to execute.

# HTTP check
--cmd "curl -sf http://127.0.0.1/health"

# TCP port check
--cmd "nc -z 127.0.0.1 3306"

# Custom script
--cmd "/usr/local/bin/check-service.sh"

# MySQL check
--cmd "mysql -u check -e 'SELECT 1'"

# Multi-step check
--cmd "sh -c 'curl -sf http://127.0.0.1/health && redis-cli ping'"

Command exit codes:

  • 0: Service healthy
  • Non-zero: Service unhealthy

Timing Options

Option Default Description
--interval N, -i N 5 Wait N seconds between health checks
--fast-interval N, -f N 1 Interval when state change is about to occur
--timeout N, -t N 5 Command execution timeout
--rise N 3 Consecutive passes before considering service UP
--fall N 2 Consecutive failures before considering service DOWN

Example: Faster detection

--interval 2 --fast-interval 0.5 --timeout 2 --rise 2 --fall 2

Disable File

Option Description
--disable FILE If FILE exists, service is considered disabled

Use case: Manual service drain

# In ExaBGP config
--disable /var/run/exabgp-haproxy.disabled

# To drain service:
touch /var/run/exabgp-haproxy.disabled

# To re-enable:
rm /var/run/exabgp-haproxy.disabled

Advertising Options

IP Address Selection

Option Description
--ip IP Advertise this IP address or network (CIDR notation)
--ip-ifname IP%IFNAME Bind IP to specific interface (e.g., 192.168.1.1%eth0)
--label LABEL Announce IPs with labels matching IFNAME:LABEL*
--label-exact-match Match label exactly (not as prefix)
--start-ip N Index of first IP in list (default: 0)

Examples:

Option 1: Explicit IP:

--ip 100.64.1.1/32

Option 2: Label matching (recommended):

# Announce all IPs labeled lo:haproxy*
--label haproxy

# Matches:
#   lo:haproxy1 (100.64.1.1/32)
#   lo:haproxy2 (100.64.1.2/32)
#   lo:haproxy3 (100.64.1.3/32)

Option 3: Bind to specific interface:

--ip-ifname 100.64.1.1%lo

IP Address Management

Option Description
--no-ip-setup Don't configure missing IP addresses on interfaces
--dynamic-ip-setup Delete IPs when service DOWN/disabled, restore when UP
--sudo Use sudo for IP address operations

Next-hop and Preference

Option Description
--next-hop IP, -N IP Self IP to use as BGP next-hop
--local-preference P LOCAL_PREF value for announced routes

Metrics (MED)

Option Default Description
--up-metric M 100 MED when service is UP
--down-metric M 1000 MED when service is DOWN
--disabled-metric M 500 MED when service is disabled
--increase M 0 Increment MED for each additional IP

Example: Metric-based failover

# Primary server: low MED when healthy
--up-metric 100 --down-metric 1000

# Backup server: higher MED
--up-metric 200 --down-metric 1100

Communities

Option Description
--community C Announce with standard community
--extended-community EC Announce with extended community
--large-community LC Announce with large community
--disabled-community C Community to use when disabled

Example:

--community 65001:100 --community 65001:200

AS-PATH Manipulation

Option Description
--as-path ASPATH AS-PATH for all states
--up-as-path ASPATH AS-PATH when service UP
--down-as-path ASPATH AS-PATH when service DOWN
--disabled-as-path ASPATH AS-PATH when service disabled

Example: Prepend when down

--up-as-path "65001" --down-as-path "65001 65001 65001"

Route Withdrawal

Option Description
--withdraw-on-down Withdraw route instead of increasing MED on failure
--deaggregate-networks Deaggregate networks specified in --ip

Advanced Options

Option Description
--path-id PATHID BGP ADD-PATH path ID
--neighbor NEIGHBOR Advertise only to selected neighbors
--debounce Announce only on state changes (not every iteration)

State Change Execution

Execute commands when service state changes:

Option Description
--execute CMD Execute on any state change
--up-execute CMD Execute when service becomes UP
--down-execute CMD Execute when service becomes DOWN
--disabled-execute CMD Execute when service disabled

Examples:

Send alert when service goes down:

--down-execute "mail -s 'Service DOWN' [email protected]"

Update monitoring system:

--up-execute "/usr/local/bin/update-monitoring UP" \
--down-execute "/usr/local/bin/update-monitoring DOWN"

Slack notification:

--down-execute "curl -X POST https://hooks.slack.com/... -d '{\"text\":\"HAProxy DOWN\"}'"

Built-in Healthcheck Examples

Example 1: HTTP Health Check with Label

# /etc/exabgp/exabgp.conf
neighbor 192.0.2.1 {
    router-id 192.0.2.2;
    local-address 192.0.2.2;
    local-as 65001;
    peer-as 65000;
}

process watch-web {
    run python -m exabgp healthcheck \
        --cmd "curl -sf http://127.0.0.1:80/health" \
        --label web \
        --interval 5 \
        --rise 3 \
        --fall 2;
}

IP setup on loopback:

ip addr add 100.64.1.1/32 dev lo label lo:web1
ip addr add 100.64.1.2/32 dev lo label lo:web2

Example 2: MySQL Health Check with Metrics

process watch-mysql {
    run python -m exabgp healthcheck \
        --cmd "mysql -u healthcheck -e 'SELECT 1'" \
        --ip 100.64.2.1/32 \
        --up-metric 100 \
        --down-metric 1000 \
        --rise 3 \
        --fall 2 \
        --community 65001:100;
}

Example 3: Multiple Services with Different Metrics

# Primary HAProxy
process watch-haproxy-primary {
    run python -m exabgp healthcheck \
        --cmd "curl -sf http://127.0.0.1:8080/health" \
        --ip 100.64.10.1/32 \
        --up-metric 100 \
        --down-metric 1000 \
        --community 65001:primary;
}

# Backup HAProxy (higher MED)
process watch-haproxy-backup {
    run python -m exabgp healthcheck \
        --cmd "curl -sf http://127.0.0.2:8080/health" \
        --ip 100.64.10.2/32 \
        --up-metric 200 \
        --down-metric 1100 \
        --community 65001:backup;
}

Example 4: Withdraw on Down (Anycast)

process watch-dns {
    run python -m exabgp healthcheck \
        --cmd "dig @127.0.0.1 example.com +short" \
        --ip 8.8.8.8/32 \
        --withdraw-on-down \
        --rise 2 \
        --fall 2;
}

Example 5: With Disable File and Execution Hooks

process watch-api {
    run python -m exabgp healthcheck \
        --cmd "/usr/local/bin/check-api.sh" \
        --label api \
        --disable /var/run/exabgp-api.disabled \
        --down-execute "logger 'API service DOWN - route withdrawn'" \
        --up-execute "logger 'API service UP - route announced'" \
        --debounce;
}

Example 6: Configuration File

/etc/exabgp/healthcheck-haproxy.conf:

# Logging
debug
syslog-facility = local0

# Naming
name = haproxy-primary

# Health check
interval = 5
fast-interval = 1
timeout = 3
rise = 3
fall = 2
command = curl -sf http://127.0.0.1:8080/health

# Advertising
label = haproxy
up-metric = 100
down-metric = 1000
community = 65001:100
withdraw-on-down

# Execution hooks
down-execute = /usr/local/bin/alert-down.sh
up-execute = /usr/local/bin/alert-up.sh

ExaBGP config:

process watch-haproxy {
    run python -m exabgp healthcheck --config /etc/exabgp/healthcheck-haproxy.conf;
}

Built-in Healthcheck vs Custom Scripts

Feature Built-in Healthcheck Custom Script
Setup Zero code required Write Python/Bash script
Rise/Fall dampening βœ… Built-in Must implement manually
IP setup βœ… Automatic Must implement manually
Logging βœ… Syslog support Must implement manually
Metrics/MED βœ… Built-in Must implement manually
Execution hooks βœ… Built-in Must implement manually
Flexibility Limited to options Unlimited
Complexity Simple Custom logic possible

Recommendation:

  • Use built-in healthcheck for 90% of use cases (HTTP, TCP, command-based checks)
  • Use custom script only when you need complex logic (multi-step checks, weighted decisions, etc.)

Basic Health Check Pattern

Simple Health Check Script

#!/usr/bin/env python3
"""
Basic health check module for ExaBGP
Announces route when service is healthy, withdraws when unhealthy
"""

import sys
import time
import subprocess
import logging

# Configuration
SERVICE_IP = "100.64.1.1/32"
CHECK_INTERVAL = 5  # seconds
RISE_THRESHOLD = 3  # consecutive passes before announcing
FALL_THRESHOLD = 2  # consecutive failures before withdrawing

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s',
    handlers=[
        logging.FileHandler('/var/log/exabgp-healthcheck.log'),
        logging.StreamHandler(sys.stderr)
    ]
)

def check_service_health():
    """
    Check if service is healthy
    Returns True if healthy, False otherwise
    """
    try:
        # Example: HTTP health check
        result = subprocess.run(
            ['curl', '-sf', 'http://localhost/health'],
            timeout=2,
            capture_output=True
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        logging.warning("Health check timed out")
        return False
    except Exception as e:
        logging.error(f"Health check error: {e}")
        return False

def announce_route():
    """Announce BGP route"""
    print(f"announce route {SERVICE_IP} next-hop 192.0.2.1")
    sys.stdout.flush()
    logging.info(f"Announced route {SERVICE_IP}")

def withdraw_route():
    """Withdraw BGP route"""
    print(f"withdraw route {SERVICE_IP}")
    sys.stdout.flush()
    logging.warning(f"Withdrew route {SERVICE_IP}")

def main():
    rise_count = 0
    fall_count = 0
    announced = False

    logging.info("Health check module started")

    while True:
        healthy = check_service_health()

        if healthy:
            rise_count += 1
            fall_count = 0

            if rise_count >= RISE_THRESHOLD and not announced:
                announce_route()
                announced = True
                rise_count = 0

        else:
            fall_count += 1
            rise_count = 0

            if fall_count >= FALL_THRESHOLD and announced:
                withdraw_route()
                announced = False
                fall_count = 0

        time.sleep(CHECK_INTERVAL)

if __name__ == '__main__':
    main()

ExaBGP Configuration

# /etc/exabgp/healthcheck.conf

neighbor 192.0.2.1 {
    router-id 192.0.2.2;
    local-address 192.0.2.2;
    local-as 65001;
    peer-as 65000;

    family {
        ipv4 unicast;
    }

    api {
        processes [ healthcheck ];
    }
}

process healthcheck {
    run /usr/local/bin/exabgp-healthcheck.py;
    encoder text;
}

Health Check Types

HTTP/HTTPS Health Checks

Use Case: Web servers, APIs, load balancers

import requests

def http_health_check(url, timeout=2):
    """
    Check HTTP endpoint
    Returns True if status code 200 and (optionally) response matches pattern
    """
    try:
        response = requests.get(url, timeout=timeout)
        return response.status_code == 200
    except requests.exceptions.RequestException:
        return False

# With content verification
def http_health_check_advanced(url, expected_text="OK", timeout=2):
    """Check HTTP endpoint with content verification"""
    try:
        response = requests.get(url, timeout=timeout)
        return response.status_code == 200 and expected_text in response.text
    except requests.exceptions.RequestException:
        return False

# Example usage
healthy = http_health_check("http://localhost:8080/health")
healthy = http_health_check_advanced("https://localhost/status", expected_text='"status":"up"')

TCP Port Checks

Use Case: Databases, message queues, generic TCP services

import socket

def tcp_port_check(host, port, timeout=2):
    """
    Check if TCP port is open and accepting connections
    """
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except Exception:
        return False

# Example usage
healthy = tcp_port_check("localhost", 3306)  # MySQL
healthy = tcp_port_check("localhost", 5432)  # PostgreSQL
healthy = tcp_port_check("localhost", 6379)  # Redis

ICMP Ping Checks

Use Case: Network reachability, simple aliveness

import subprocess

def ping_check(host, count=1, timeout=2):
    """
    Ping host and return True if reachable
    """
    try:
        result = subprocess.run(
            ['ping', '-c', str(count), '-W', str(timeout), host],
            timeout=timeout + 1,
            capture_output=True
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    except Exception:
        return False

# Example usage
healthy = ping_check("192.168.1.1")

Command Execution Checks

Use Case: Custom check scripts, database queries, file checks

import subprocess

def command_check(command, timeout=5):
    """
    Execute command and return True if exit code is 0
    """
    try:
        result = subprocess.run(
            command,
            shell=True if isinstance(command, str) else False,
            timeout=timeout,
            capture_output=True
        )
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    except Exception:
        return False

# Examples
healthy = command_check("systemctl is-active nginx")
healthy = command_check(["mysql", "-e", "SELECT 1"])
healthy = command_check("test -f /var/run/myapp.pid")

Multi-Service Checks

Use Case: Multiple services must all be healthy

def multi_service_check():
    """
    Check multiple services - all must be healthy
    """
    checks = {
        'nginx': lambda: http_health_check("http://localhost:80"),
        'redis': lambda: tcp_port_check("localhost", 6379),
        'app': lambda: http_health_check("http://localhost:8080/health"),
    }

    results = {}
    for name, check_func in checks.items():
        results[name] = check_func()
        if not results[name]:
            logging.warning(f"Service {name} is unhealthy")

    all_healthy = all(results.values())
    logging.info(f"Health check results: {results}, all healthy: {all_healthy}")

    return all_healthy

Dampening and Flap Prevention

Rise/Fall Counters

Problem: Transient failures cause route flapping.

Solution: Require multiple consecutive passes/fails.

class HealthCheckDampener:
    """Dampening logic for health checks"""

    def __init__(self, rise_threshold=3, fall_threshold=2):
        self.rise_threshold = rise_threshold
        self.fall_threshold = fall_threshold
        self.rise_count = 0
        self.fall_count = 0
        self.state = 'down'  # Current state: 'up' or 'down'

    def update(self, healthy):
        """
        Update health state based on check result
        Returns True if state changed
        """
        previous_state = self.state

        if healthy:
            self.rise_count += 1
            self.fall_count = 0

            if self.rise_count >= self.rise_threshold:
                self.state = 'up'
                self.rise_count = 0

        else:
            self.fall_count += 1
            self.rise_count = 0

            if self.fall_count >= self.fall_threshold:
                self.state = 'down'
                self.fall_count = 0

        return self.state != previous_state

    def is_up(self):
        """Return True if state is 'up'"""
        return self.state == 'up'

# Usage
dampener = HealthCheckDampener(rise_threshold=3, fall_threshold=2)

while True:
    healthy = check_service_health()
    state_changed = dampener.update(healthy)

    if state_changed:
        if dampener.is_up():
            announce_route()
        else:
            withdraw_route()

    time.sleep(5)

Hysteresis (Different Thresholds)

Use different thresholds for bringing route up vs down:

RISE_THRESHOLD = 3  # Require 3 passes to announce (cautious)
FALL_THRESHOLD = 2  # Only 2 failures to withdraw (fast failover)

Rationale:

  • Higher rise threshold: Avoid announcing prematurely after restart
  • Lower fall threshold: Fail fast when service actually dies

Advanced Patterns

Weighted Health Checks

Use Case: Different checks have different importance.

def weighted_health_check():
    """
    Weighted health checks - return True if score > threshold
    """
    checks = {
        'critical': {
            'app_health': {'weight': 10, 'check': lambda: http_health_check("http://localhost:8080/health")},
            'database': {'weight': 10, 'check': lambda: tcp_port_check("localhost", 5432)},
        },
        'important': {
            'cache': {'weight': 5, 'check': lambda: tcp_port_check("localhost", 6379)},
        },
        'optional': {
            'monitoring': {'weight': 1, 'check': lambda: tcp_port_check("localhost", 9090)},
        }
    }

    total_score = 0
    max_score = 0

    for category, items in checks.items():
        for name, config in items.items():
            max_score += config['weight']
            if config['check']():
                total_score += config['weight']
            else:
                logging.warning(f"Check {name} ({category}) failed")

    health_percentage = (total_score / max_score) * 100 if max_score > 0 else 0
    healthy = health_percentage >= 80  # Require 80% score

    logging.info(f"Health score: {total_score}/{max_score} ({health_percentage:.1f}%)")

    return healthy

Dependency Checks

Use Case: Service A depends on Service B.

def dependency_check():
    """
    Check dependencies in order - fail fast if dependency fails
    """
    # Check critical dependencies first
    if not tcp_port_check("localhost", 5432):  # Database
        logging.error("Database down - service cannot function")
        return False

    if not tcp_port_check("localhost", 6379):  # Cache
        logging.error("Cache down - service cannot function")
        return False

    # Only check app if dependencies are up
    if not http_health_check("http://localhost:8080/health"):
        logging.error("App health check failed")
        return False

    return True

Graceful Degradation

Use Case: Announce with higher MED when degraded (not fully healthy).

def graceful_degradation_check():
    """
    Return health status with degradation level
    Returns: ('healthy', 'degraded', or 'down'), med_value
    """
    # Check critical services
    app_ok = http_health_check("http://localhost:8080/health")
    db_ok = tcp_port_check("localhost", 5432)

    # Check optional services
    cache_ok = tcp_port_check("localhost", 6379)

    if app_ok and db_ok and cache_ok:
        return ('healthy', 100)  # MED 100 - fully healthy

    elif app_ok and db_ok:
        return ('degraded', 150)  # MED 150 - degraded (no cache)

    else:
        return ('down', None)  # Completely down

# Usage
while True:
    status, med = graceful_degradation_check()

    if status == 'healthy':
        print(f"announce route {SERVICE_IP} next-hop 192.0.2.1 med {med}")
        sys.stdout.flush()

    elif status == 'degraded':
        print(f"announce route {SERVICE_IP} next-hop 192.0.2.1 med {med}")
        sys.stdout.flush()
        logging.warning("Service degraded - announcing with higher MED")

    elif status == 'down':
        print(f"withdraw route {SERVICE_IP}")
        sys.stdout.flush()

    time.sleep(10)

Production Health Check Module

Complete production-ready health check module with all features:

#!/usr/bin/env python3
"""
Production Health Check Module for ExaBGP
Features:
- Multiple check types (HTTP, TCP, command)
- Rise/fall dampening
- Weighted checks
- Graceful degradation with MED
- Comprehensive logging
- Signal handling
"""

import sys
import time
import signal
import logging
import subprocess
import socket
from typing import Dict, Callable, Tuple, Optional

# Configuration
CONFIG = {
    'service_ip': '100.64.1.1/32',
    'check_interval': 5,
    'rise_threshold': 3,
    'fall_threshold': 2,
    'log_file': '/var/log/exabgp-healthcheck.log',
}

# Health checks configuration
CHECKS = {
    'app_http': {
        'type': 'http',
        'url': 'http://localhost:8080/health',
        'weight': 10,
        'timeout': 2,
    },
    'database': {
        'type': 'tcp',
        'host': 'localhost',
        'port': 5432,
        'weight': 10,
        'timeout': 2,
    },
    'cache': {
        'type': 'tcp',
        'host': 'localhost',
        'port': 6379,
        'weight': 5,
        'timeout': 2,
    },
}

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s: %(message)s',
    handlers=[
        logging.FileHandler(CONFIG['log_file']),
        logging.StreamHandler(sys.stderr)
    ]
)

# Global shutdown flag
shutdown_flag = False

def signal_handler(signum, frame):
    """Handle shutdown signals gracefully"""
    global shutdown_flag
    logging.info(f"Received signal {signum}, shutting down gracefully")
    shutdown_flag = True

signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)

def http_check(url: str, timeout: int = 2) -> bool:
    """HTTP health check"""
    try:
        import requests
        response = requests.get(url, timeout=timeout)
        return response.status_code == 200
    except Exception as e:
        logging.debug(f"HTTP check failed for {url}: {e}")
        return False

def tcp_check(host: str, port: int, timeout: int = 2) -> bool:
    """TCP port check"""
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except Exception as e:
        logging.debug(f"TCP check failed for {host}:{port}: {e}")
        return False

def command_check(command: str, timeout: int = 5) -> bool:
    """Command execution check"""
    try:
        result = subprocess.run(
            command,
            shell=True,
            timeout=timeout,
            capture_output=True
        )
        return result.returncode == 0
    except Exception as e:
        logging.debug(f"Command check failed for '{command}': {e}")
        return False

def run_checks() -> Tuple[bool, int]:
    """
    Run all configured health checks
    Returns: (healthy: bool, med: int)
    """
    total_weight = sum(check['weight'] for check in CHECKS.values())
    current_weight = 0

    for name, config in CHECKS.items():
        check_type = config['type']
        passed = False

        if check_type == 'http':
            passed = http_check(config['url'], config.get('timeout', 2))
        elif check_type == 'tcp':
            passed = tcp_check(config['host'], config['port'], config.get('timeout', 2))
        elif check_type == 'command':
            passed = command_check(config['command'], config.get('timeout', 5))

        if passed:
            current_weight += config['weight']
        else:
            logging.warning(f"Check '{name}' failed")

    health_percentage = (current_weight / total_weight) * 100 if total_weight > 0 else 0

    # Determine health status and MED
    if health_percentage >= 90:
        # Fully healthy
        return (True, 100)
    elif health_percentage >= 70:
        # Degraded but functional
        return (True, 150)
    else:
        # Too degraded, withdraw
        return (False, None)

class HealthState:
    """Track health state with dampening"""

    def __init__(self, rise_threshold: int, fall_threshold: int):
        self.rise_threshold = rise_threshold
        self.fall_threshold = fall_threshold
        self.rise_count = 0
        self.fall_count = 0
        self.announced = False
        self.current_med = None

    def update(self, healthy: bool, med: Optional[int]) -> bool:
        """
        Update state based on check result
        Returns True if announcement state should change
        """
        if healthy:
            self.rise_count += 1
            self.fall_count = 0

            if self.rise_count >= self.rise_threshold or self.announced:
                # Announce or update MED
                should_update = not self.announced or self.current_med != med
                self.announced = True
                self.current_med = med
                self.rise_count = 0
                return should_update
        else:
            self.fall_count += 1
            self.rise_count = 0

            if self.fall_count >= self.fall_threshold and self.announced:
                # Withdraw
                self.announced = False
                self.current_med = None
                self.fall_count = 0
                return True

        return False

def announce_route(med: int):
    """Announce BGP route with MED"""
    cmd = f"announce route {CONFIG['service_ip']} next-hop self med {med}"
    print(cmd)
    sys.stdout.flush()
    logging.info(f"Announced route with MED {med}")

def withdraw_route():
    """Withdraw BGP route"""
    cmd = f"withdraw route {CONFIG['service_ip']} next-hop self"
    print(cmd)
    sys.stdout.flush()
    logging.warning("Withdrew route")

def main():
    """Main health check loop"""
    logging.info("Production health check module started")
    state = HealthState(CONFIG['rise_threshold'], CONFIG['fall_threshold'])

    while not shutdown_flag:
        healthy, med = run_checks()
        should_update = state.update(healthy, med)

        if should_update:
            if state.announced:
                announce_route(state.current_med)
            else:
                withdraw_route()

        time.sleep(CONFIG['check_interval'])

    # Graceful shutdown - withdraw route
    if state.announced:
        logging.info("Shutting down - withdrawing route")
        withdraw_route()

    logging.info("Health check module stopped")

if __name__ == '__main__':
    main()

Integration Examples

With HAProxy

Monitor HAProxy backend health:

import requests

def haproxy_backend_check(stats_url, backend_name):
    """Check if HAProxy backend has at least one UP server"""
    try:
        response = requests.get(f"{stats_url};csv")
        lines = response.text.split('\n')

        for line in lines:
            if backend_name in line and ',UP,' in line:
                return True

        return False
    except:
        return False

# Usage
healthy = haproxy_backend_check("http://localhost:8404/stats", "webservers")

With Kubernetes

Check pod readiness:

import subprocess
import json

def kubernetes_pod_ready(namespace, app_label):
    """Check if at least one pod with app label is ready"""
    try:
        result = subprocess.run(
            ['kubectl', 'get', 'pods', '-n', namespace,
             '-l', f'app={app_label}', '-o', 'json'],
            timeout=5,
            capture_output=True
        )

        if result.returncode != 0:
            return False

        pods = json.loads(result.stdout)

        for pod in pods.get('items', []):
            conditions = pod.get('status', {}).get('conditions', [])
            for condition in conditions:
                if condition['type'] == 'Ready' and condition['status'] == 'True':
                    return True

        return False
    except:
        return False

# Usage
healthy = kubernetes_pod_ready("default", "myapp")

Monitoring and Logging

Metrics Export

Export health check metrics for Prometheus:

from prometheus_client import Gauge, Counter, start_http_server

# Metrics
health_status = Gauge('exabgp_health_status', 'Current health status (1=up, 0=down)')
check_duration = Gauge('exabgp_check_duration_seconds', 'Health check duration')
state_changes = Counter('exabgp_state_changes_total', 'Total state changes', ['from_state', 'to_state'])

# Start metrics server
start_http_server(9100)

# Update metrics
health_status.set(1 if healthy else 0)
check_duration.set(duration)
state_changes.labels(from_state='down', to_state='up').inc()

Structured Logging

Use structured logging for better analysis:

import json
import logging

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_obj = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'message': record.getMessage(),
        }
        return json.dumps(log_obj)

handler = logging.FileHandler('/var/log/exabgp-healthcheck.json')
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)

logging.info("Health check passed", extra={'check': 'http', 'url': 'http://localhost:8080'})

Common Pitfalls

  1. No timeout on checks: Always set timeouts (2-5 seconds typical)
  2. No dampening: Causes route flapping on transient failures
  3. Blocking checks: Use subprocess.run with timeout, not os.system
  4. Forgot sys.stdout.flush(): Commands buffer and don't reach ExaBGP
  5. No logging: Impossible to troubleshoot when things break
  6. Checking too frequently: Every 5-10 seconds is usually sufficient
  7. Not handling shutdown gracefully: Routes not withdrawn on stop

See Also


πŸ‘» Ghost written by Claude (Anthropic AI)

Clone this wiki locally