-
Notifications
You must be signed in to change notification settings - Fork 462
Load Balancing
Dynamic BGP-based traffic distribution without hardware load balancers
βοΈ BGP-based traffic distribution - automatic failover and equal-cost multi-path routing
β οΈ Important Limitation: BGP provides equal distribution (ECMP) or primary/backup (MED). It does NOT provide weighted/proportional distribution. For weighted load balancing, use Layer 7 load balancers (HAProxy/NGINX).
- Overview
- Load Balancing Strategies
- Architecture Patterns
- MED-Based Distribution
- ECMP Load Balancing
- Proportional Load Distribution (NOT Possible with BGP Alone)
- Multi-Tier Load Balancing
- Health Check Integration
- Implementation Examples
- Best Practices
- Monitoring and Metrics
- Troubleshooting
Load balancing with ExaBGP eliminates the need for expensive hardware load balancers by using BGP to distribute traffic across backend servers.
Hardware load balancer approach:
Internet β [Hardware LB] β Backend Servers
(SPOF)
(Expensive)
(Vendor lock-in)
Issues:
- Single point of failure
- Expensive hardware ($10K-$100K+)
- Vendor lock-in
- Limited scalability
- Manual configuration
BGP-based approach:
Internet β [Network (ECMP)] β Backend Servers
(Distributed) (ExaBGP announces routes)
(No SPOF) (Health-aware)
Benefits:
- No single point of failure
- Open source (zero licensing cost)
- Vendor-neutral
- Unlimited horizontal scaling
- Application-aware (real-time metrics)
- Dynamic (automatic failover)
All servers announce same route with equal cost:
Server 1 β announces 100.10.0.100/32 β receives 33% traffic
Server 2 β announces 100.10.0.100/32 β receives 33% traffic
Server 3 β announces 100.10.0.100/32 β receives 34% traffic
Use case: Identical servers with equal capacity
β οΈ Important: MED does NOT provide proportional distributionMED (Multi-Exit Discriminator) affects BGP route selection but does NOT distribute traffic proportionally:
- Lower MED = preferred path
- If one route has lower MED, it receives ALL traffic (not "more" traffic)
- ECMP (equal distribution) only works when routes have equal cost AFTER considering MED
MED is for primary/backup, not weighted load balancing.
MED for primary/backup failover:
Primary server β MED 100 β receives ALL traffic (preferred)
Backup server β MED 200 β receives NO traffic (unless primary fails)
Use case: Active/standby configuration with automatic failover
One primary, one or more backups:
Server 1: MED 100 β Active (receives all traffic)
Server 2: MED 200 β Standby (receives no traffic unless Server 1 fails)
Server 3: MED 300 β Standby (receives no traffic unless Server 1 & 2 fail)
Use case: Traditional active-passive HA with priority ordering
Different metrics per service IP:
Server 1:
- Service A (100.10.0.10) β MED 100 (primary)
- Service B (100.10.0.20) β MED 150 (backup)
Server 2:
- Service A (100.10.0.10) β MED 150 (backup)
- Service B (100.10.0.20) β MED 100 (primary)
Result: Service A primarily on Server 1, Service B primarily on Server 2
Use case: Load distribution across multiple services
Simple, flat architecture:
ββββββββββββββββββββββββββββββββββββββββββββββ
β Internet / Clients β
ββββββββββββββββββ¬ββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββ
β Edge Router β β Receives routes from all servers
β (ECMP enabled)β
βββββββββ¬ββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Server 1β β Server 2β β Server 3β
β ExaBGP β β ExaBGP β β ExaBGP β
β 100.10. β β 100.10. β β 100.10. β
β 0.100 β β 0.100 β β 0.100 β
βββββββββββ βββββββββββ βββββββββββ
Characteristics:
- Direct BGP peering to edge router
- ECMP distributes traffic equally
- Per-flow load balancing (same source β same server)
- Simple configuration
Configuration:
# Each server runs identical ExaBGP config
neighbor 192.168.1.1 {
router-id 192.168.1.10;
local-address 192.168.1.10;
local-as 65001;
peer-as 65000;
family {
ipv4 unicast;
}
api {
processes [ load-balancer ];
}
}
process load-balancer {
run /etc/exabgp/lb-health.py;
encoder text;
}For large deployments:
Internet
β
βΌ
βββββββββββββββββ
β Edge Routers β
β (100s-1000s) β
βββββββββ¬ββββββββ
β
βΌ
βββββββββββββββββ
β Route β β BIRD/FRRouting route reflectors
β Reflectors β Select best paths
βββββββββ¬ββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Server 1β β Server 2β β Server 3β
β ExaBGP β β ExaBGP β β ExaBGP β
βββββββββββ βββββββββββ βββββββββββ
Benefits:
- Scales to thousands of servers
- Centralized policy enforcement
- Reduced BGP session overhead
- Clean separation of concerns
Four-tier architecture (Vincent Bernat pattern):
Tier 0: DNS (Geographic distribution)
β
βΌ
Tier 1: BGP + ExaBGP + ECMP (L3 distribution)
β
βΌ
Tier 2: IPVS (L4 consistent hashing)
β
βΌ
Tier 3: HAProxy (L7 application routing)
β
βΌ
Backend Servers
Each tier serves different purpose:
- Tier 0 (DNS): Geographic load balancing
- Tier 1 (ExaBGP + ECMP): Network-level distribution
- Tier 2 (IPVS): Consistent hashing L4 (minimizes connection disruption)
- Tier 3 (HAProxy): Application-level routing (host headers, paths, etc.)
ExaBGP's role: Announce load balancer IPs to enable ECMP distribution
MED (Multi-Exit Discriminator) is a BGP attribute that influences path selection.
- Lower MED = Preferred path (receives more traffic)
- Higher MED = Less preferred (receives less traffic)
- MED compared only among routes from same AS
Three servers with different capacities:
#!/usr/bin/env python3
"""
MED-based load distribution
Different servers announce with different metrics
"""
import sys
import time
SERVICE_IP = "100.10.0.100"
# Server capacity configuration
# High-end server: MED 50
# Mid-range server: MED 100
# Low-end server: MED 150
SERVER_MED = 100 # Set per server
time.sleep(2)
while True:
sys.stdout.write(
f"announce route {SERVICE_IP}/32 next-hop self med {SERVER_MED}\n"
)
sys.stdout.flush()
time.sleep(30) # Refresh every 30 secondsResult: High-end server (MED 50) receives most traffic
Adjust MED based on real-time CPU load:
#!/usr/bin/env python3
"""
Dynamic load-based traffic distribution
Higher CPU usage β Higher MED β Less traffic
"""
import sys
import time
import psutil
SERVICE_IP = "100.10.0.100"
BASE_MED = 100
def calculate_med():
"""Calculate MED based on current system load"""
# Get CPU usage
cpu_percent = psutil.cpu_percent(interval=1)
# Get memory usage
mem = psutil.virtual_memory()
mem_percent = mem.percent
# Get connection count
connections = len(psutil.net_connections(kind='inet'))
# Calculate load factor
# CPU: 0-100% β 0-100 points
# Memory: 0-100% β 0-50 points
# Connections: 0-10000 β 0-50 points
load_factor = int(
cpu_percent +
(mem_percent * 0.5) +
(min(connections, 10000) / 10000 * 50)
)
# MED = BASE_MED + load_factor
# Low load: MED ~100
# High load: MED ~300
med = BASE_MED + load_factor
return med
time.sleep(2)
sys.stderr.write("[LOAD-BALANCER] Dynamic load balancer started\n")
while True:
med = calculate_med()
sys.stdout.write(
f"announce route {SERVICE_IP}/32 next-hop self med {med}\n"
)
sys.stdout.flush()
sys.stderr.write(f"[LOAD] Announced with MED={med}\n")
# Update every 30 seconds
time.sleep(30)How it works:
Server 1: 30% CPU β MED 130 β Lower metric β More traffic β
Server 2: 60% CPU β MED 160 β Medium metric β Medium traffic
Server 3: 90% CPU β MED 190 β Higher metric β Less traffic
Result: Traffic automatically distributed based on available capacity
Distribute different services across servers:
#!/usr/bin/env python3
"""
Multi-service load distribution
Each server is primary for different service IPs
"""
import sys
import time
# Service IP configuration
# Each server has different primary service
SERVICES = [
("100.10.0.10", 100), # Web service
("100.10.0.20", 150), # API service
("100.10.0.30", 200), # Database read replicas
]
# On Server 1: Web primary (100), API backup (150), DB backup (200)
# On Server 2: API primary (100), DB backup (150), Web backup (200)
# On Server 3: DB primary (100), Web backup (150), API backup (200)
time.sleep(2)
while True:
for service_ip, med in SERVICES:
sys.stdout.write(
f"announce route {service_ip}/32 next-hop self med {med}\n"
)
sys.stdout.flush()
time.sleep(30)Result: Even load distribution across servers
ECMP (Equal-Cost Multi-Path) allows routers to distribute traffic across multiple equal-cost paths.
1. Multiple servers announce same route:
Server 1 β announce 100.10.0.100/32
Server 2 β announce 100.10.0.100/32
Server 3 β announce 100.10.0.100/32
2. Router sees 3 equal-cost paths:
Router RIB:
100.10.0.100/32 via 192.168.1.10 (Server 1)
via 192.168.1.11 (Server 2)
via 192.168.1.12 (Server 3)
3. Router distributes traffic:
Flow hashing (src IP, dst IP, src port, dst port, protocol)
β Hash determines which path
β Same flow always goes to same server (connection persistence)
Cisco IOS-XR:
router bgp 65000
address-family ipv4 unicast
maximum-paths ibgp 8
maximum-paths ebgp 8
!
!
Juniper Junos:
protocols {
bgp {
group servers {
multipath;
}
}
}
Arista EOS:
router bgp 65000
maximum-paths 8
Simple health-check based announcement:
#!/usr/bin/env python3
"""
ECMP load balancing with health checks
All healthy servers announce same route
"""
import sys
import time
import socket
SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5
def is_healthy():
"""Check if local service is healthy"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(2)
result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
sock.close()
return result == 0
except:
return False
time.sleep(2)
announced = False
sys.stderr.write("[ECMP] Load balancer started\n")
while True:
healthy = is_healthy()
if healthy and not announced:
# Service healthy, announce route
sys.stdout.write(
f"announce route {SERVICE_IP}/32 next-hop self\n"
)
sys.stdout.flush()
sys.stderr.write(f"[ECMP] Service healthy, announcing route\n")
announced = True
elif not healthy and announced:
# Service failed, withdraw route
sys.stdout.write(
f"withdraw route {SERVICE_IP}/32 next-hop self\n"
)
sys.stdout.flush()
sys.stderr.write(f"[ECMP] Service failed, withdrawing route\n")
announced = False
time.sleep(CHECK_INTERVAL)
β οΈ Critical: BGP Cannot Do Weighted/Proportional Traffic DistributionReality:
- BGP + ECMP provides equal distribution across announced routes (flow-based hashing)
- There is NO way to make one server receive "twice as much traffic" as another via BGP
- MED does NOT provide proportional distribution (it's for primary/backup selection)
For proportional/weighted load balancing, use:
- Layer 7 Load Balancer (HAProxy, NGINX) with weighted backends
- DNS-based weighted round-robin (limited, client-side caching issues)
- Multi-tier architecture: ExaBGP β L4 load balancers β Layer 7 weighted distribution
Problem:
Server 1: 64 GB RAM, 16 CPU cores (high-capacity)
Server 2: 32 GB RAM, 8 CPU cores (medium-capacity)
Server 3: 16 GB RAM, 4 CPU cores (low-capacity)
What you CANNOT do:
- Make Server 1 receive 4x traffic of Server 3 via BGP
- Proportionally distribute traffic based on capacity
- Adjust traffic percentage dynamically
What you CAN do:
Option 1: ECMP with Load-Based Withdrawal
#!/usr/bin/env python3
"""
Binary health check - withdraw when overloaded
Each server announces same route, ECMP distributes equally
Overloaded servers withdraw to prevent failure
"""
import sys
import time
import psutil
SERVICE_IP = "100.10.0.100"
def is_overloaded():
cpu = psutil.cpu_percent(interval=1)
# Low-capacity server: withdraw at 80% CPU
# High-capacity server: withdraw at 95% CPU
threshold = 80 # Adjust per server capacity
return cpu > threshold
announced = False
time.sleep(2)
while True:
if is_overloaded():
if announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
sys.stdout.flush()
announced = False
else:
if not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop 192.0.2.1\n")
sys.stdout.flush()
announced = True
time.sleep(5)Result: Equal distribution, but overloaded servers drop out
Option 2: Multiple Service IPs
Announce 3 different service IPs, assign to servers based on capacity:
- 100.10.0.10 β All 3 servers announce (ECMP: equal split)
- 100.10.0.11 β Only Server 1 announces (100% to Server 1)
- 100.10.0.12 β Only Server 1 announces (100% to Server 1)
Client-side uses all 3 IPs (e.g., DNS returns all 3)
Rough approximation: Server 1 gets ~66%, others ~33% combined
Option 3: Multi-Tier with Layer 7
ExaBGP (BGP layer)
β ECMP (equal distribution)
HAProxy/NGINX Tier (multiple instances)
β Weighted backends (2:1:1 ratio)
Backend Servers (heterogeneous capacity)
This is the correct architecture for proportional distribution.
Production pattern for hyperscale deployments:
ββββββββββββββββββββββββββββββββββββββββ
β Tier 0: DNS (GeoDNS) β β Geographic distribution
β Returns nearest datacenter IP β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Tier 1: BGP + ExaBGP + ECMP β β Network-level distribution
β Edge routers use ECMP β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Tier 2: IPVS (L4 LB) β β Consistent hashing
β Maglev scheduling minimizes β (connection persistence)
β connection disruption β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Tier 3: HAProxy (L7 LB) β β Application routing
β Host headers, URL paths, SSL term β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
Backend Servers
Tier 1 Load Balancer Config:
# /etc/exabgp/multi-tier-lb.conf
neighbor 192.168.1.1 {
router-id 192.168.1.10;
local-address 192.168.1.10;
local-as 65001;
peer-as 65000;
family {
ipv4 unicast;
ipv6 unicast;
}
api {
processes [ tier1-announcer ];
}
}
process tier1-announcer {
run /etc/exabgp/tier1-announcer.py;
encoder text;
}Health Check Script:
#!/usr/bin/env python3
"""
Multi-tier load balancer health check
Announces loopback IPs when IPVS/HAProxy are ready
"""
import sys
import os
import time
import netifaces
READY_FILE = '/etc/lb/v6-ready'
DISABLE_FILE = '/etc/lb/disable'
LOOPBACK_INTERFACE = 'lo'
CHECK_INTERVAL = 5
def get_loopback_ips():
"""Get all IPs configured on loopback interface"""
addrs = netifaces.ifaddresses(LOOPBACK_INTERFACE)
ips = []
# IPv4 addresses
if netifaces.AF_INET in addrs:
ips.extend([a['addr'] for a in addrs[netifaces.AF_INET]])
# IPv6 addresses
if netifaces.AF_INET6 in addrs:
ips.extend([a['addr'].split('%')[0] for a in addrs[netifaces.AF_INET6]])
return [ip for ip in ips if not ip.startswith('127.') and ip != '::1']
def is_service_ready():
"""Check if service should announce routes"""
return os.path.exists(READY_FILE) and not os.path.exists(DISABLE_FILE)
def check_ipvs_healthy():
"""Check IPVS is running and has healthy backends"""
try:
import subprocess
result = subprocess.run(
['ipvsadm', '-L', '-n'],
capture_output=True,
timeout=2
)
# Parse output to check for active destinations
return result.returncode == 0 and b'ActiveConn' in result.stdout
except:
return False
def check_haproxy_healthy():
"""Check HAProxy has healthy backends"""
try:
import socket
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.connect('/var/run/haproxy.sock')
s.send(b'show stat\n')
stats = s.recv(8192).decode()
s.close()
# Check for UP backends
return 'UP' in stats
except:
return False
# Get service IPs to announce
service_ips = get_loopback_ips()
time.sleep(2)
sys.stderr.write(f"[TIER1] Multi-tier LB started, monitoring {len(service_ips)} IPs\n")
announced = False
while True:
# All checks must pass
ready = (
is_service_ready() and
check_ipvs_healthy() and
check_haproxy_healthy()
)
if ready and not announced:
# Announce all service IPs
for ip in service_ips:
if ':' in ip:
# IPv6
sys.stdout.write(f'announce route {ip}/128 next-hop self\n')
else:
# IPv4
sys.stdout.write(f'announce route {ip}/32 next-hop self\n')
sys.stdout.flush()
sys.stderr.write(f"[TIER1] Services healthy, announced {len(service_ips)} routes\n")
announced = True
elif not ready and announced:
# Withdraw all service IPs
for ip in service_ips:
if ':' in ip:
sys.stdout.write(f'withdraw route {ip}/128\n')
else:
sys.stdout.write(f'withdraw route {ip}/32\n')
sys.stdout.flush()
sys.stderr.write(f"[TIER1] Services unhealthy, withdrew routes\n")
announced = False
time.sleep(CHECK_INTERVAL)Maintenance workflow:
# Enter maintenance mode (gradual traffic drain)
touch /etc/lb/disable
# Wait for connections to drain (~60 seconds)
sleep 60
# Perform maintenance
systemctl restart ipvsadm
systemctl restart haproxy
# Exit maintenance mode
rm /etc/lb/disableCheck all dependencies before announcing:
#!/usr/bin/env python3
"""
Comprehensive health checking for load balancing
Checks web server, database, cache, disk, memory
"""
import sys
import time
import socket
import urllib.request
import psycopg2
import redis
SERVICE_IP = "100.10.0.100"
CHECK_INTERVAL = 5
def check_web_server():
"""Check web server responds"""
try:
response = urllib.request.urlopen('http://127.0.0.1/health', timeout=2)
return response.getcode() == 200
except:
return False
def check_database():
"""Check database is accessible"""
try:
conn = psycopg2.connect(
host='127.0.0.1',
database='mydb',
user='monitor',
password='secret',
connect_timeout=2
)
cursor = conn.cursor()
cursor.execute('SELECT 1')
result = cursor.fetchone()
conn.close()
return result[0] == 1
except:
return False
def check_redis():
"""Check Redis is accessible"""
try:
r = redis.Redis(host='127.0.0.1', port=6379, socket_timeout=2)
return r.ping()
except:
return False
def check_system_resources():
"""Check disk and memory"""
import shutil
import psutil
# Check disk space (at least 10% free)
stat = shutil.disk_usage('/')
free_percent = (stat.free / stat.total) * 100
if free_percent < 10:
return False
# Check memory (at least 1 GB free)
mem = psutil.virtual_memory()
if mem.available < 1024 * 1024 * 1024:
return False
return True
def comprehensive_health_check():
"""Run all health checks"""
checks = {
'web': check_web_server(),
'database': check_database(),
'redis': check_redis(),
'resources': check_system_resources(),
}
# Log individual check results
for name, result in checks.items():
status = "OK" if result else "FAIL"
sys.stderr.write(f"[HEALTH] {name}: {status}\n")
# All checks must pass
return all(checks.values())
time.sleep(2)
announced = False
while True:
healthy = comprehensive_health_check()
if healthy and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write("[HEALTH] All checks passed, announcing route\n")
announced = True
elif not healthy and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write("[HEALTH] Health checks failed, withdrawing route\n")
announced = False
time.sleep(CHECK_INTERVAL)Step 1: Configure loopback IP on all servers
# Add service IP to loopback
ip addr add 100.10.0.100/32 dev loStep 2: ExaBGP configuration
# /etc/exabgp/lb.conf
neighbor 192.168.1.1 {
router-id 192.168.1.10;
local-address 192.168.1.10;
local-as 65001;
peer-as 65000;
family {
ipv4 unicast;
}
api {
processes [ lb-health ];
}
}
process lb-health {
run /etc/exabgp/lb-health.py;
encoder text;
}Step 3: Health check script
#!/usr/bin/env python3
import sys
import time
import socket
SERVICE_IP = "100.10.0.100"
def is_healthy():
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(2)
result = sock.connect_ex(('127.0.0.1', 80))
sock.close()
return result == 0
except:
return False
time.sleep(2)
announced = False
while True:
if is_healthy() and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = True
elif not is_healthy() and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = False
time.sleep(5)Step 4: Start ExaBGP
exabgp /etc/exabgp/lb.confStep 5: Enable ECMP on router
router bgp 65000
maximum-paths 8
Step 6: Verify
# Check routes on router
show ip bgp 100.10.0.100
# Should see multiple pathsAlways use loopback interface:
# Correct
ip addr add 100.10.0.100/32 dev lo
# Wrong (don't use physical interface)
# ip addr add 100.10.0.100/24 dev eth0Why: Loopback IPs don't fail when interface goes down
# Cisco
router bgp 65000
maximum-paths ibgp 8
maximum-paths ebgp 8
# Juniper
set protocols bgp group servers multipath
# Arista
router bgp 65000
maximum-paths 8
Prevent route flapping:
RISE_THRESHOLD = 3 # 3 consecutive successes to announce
FALL_THRESHOLD = 2 # 2 consecutive failures to withdraw
rise_count = 0
fall_count = 0
while True:
healthy = check_health()
if healthy:
rise_count += 1
fall_count = 0
if rise_count >= RISE_THRESHOLD and not announced:
announce_route()
else:
fall_count += 1
rise_count = 0
if fall_count >= FALL_THRESHOLD and announced:
withdraw_route()import subprocess
def check_bgp_session():
"""Verify BGP session is established"""
result = subprocess.run(
['exabgpcli', 'show', 'neighbor', 'summary'],
capture_output=True
)
return b'Established' in result.stdoutimport logging
logging.basicConfig(
filename='/var/log/exabgp-lb.log',
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(message)s'
)
def announce_route():
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
logging.info(f"ANNOUNCE: {SERVICE_IP}")1. Load Distribution:
- Requests per server
- Bandwidth per server
- Connection count per server
2. Health Checks:
- Success rate
- Latency
- Consecutive failures
3. BGP State:
- Session status
- Routes announced
- Routes withdrawn
4. System Metrics:
- CPU usage
- Memory usage
- Network throughput
#!/usr/bin/env python3
from prometheus_client import start_http_server, Gauge, Counter
# Metrics
route_announced = Gauge('lb_route_announced', 'Route announcement status')
health_check_status = Gauge('lb_health_check_status', 'Health check result')
health_checks_total = Counter('lb_health_checks_total', 'Health checks', ['result'])
route_changes_total = Counter('lb_route_changes_total', 'Route changes', ['action'])
# Start metrics server
start_http_server(9100)
while True:
healthy = check_health()
health_check_status.set(1 if healthy else 0)
health_checks_total.labels(result='success' if healthy else 'failure').inc()
if healthy and not announced:
announce_route()
route_announced.set(1)
route_changes_total.labels(action='announce').inc()
announced = True
elif not healthy and announced:
withdraw_route()
route_announced.set(0)
route_changes_total.labels(action='withdraw').inc()
announced = FalseSymptoms: One server receives all traffic despite ECMP
Check:
# Verify ECMP enabled
show ip bgp 100.10.0.100
# Should show "multipath" or multiple paths
# Check routing table
show ip route 100.10.0.100
# Should show multiple next-hopsSolutions:
# Enable ECMP
router bgp 65000
maximum-paths 8
# Verify BGP best path selection
show ip bgp 100.10.0.100 bestpath
Symptoms: Routes repeatedly announced/withdrawn
Diagnosis:
# Monitor BGP updates
show ip bgp neighbors 192.168.1.10 | include LastSolutions:
- Implement rise/fall thresholds
- Increase health check interval
- Add retry logic
- Fix unstable service
Symptoms: Traffic continues to failed server
Check:
# Check BGP timers
show ip bgp neighbors 192.168.1.10
# Check health check frequency
tail -f /var/log/exabgp.logSolutions:
- Reduce health check interval (5s recommended)
- Tune BGP timers (keepalive 10s, hold 30s)
- Enable BFD for fast failure detection
- Service High Availability - Complete HA guide
- Anycast Management - Anycast patterns
- Traffic Engineering - Advanced traffic control
- Monitoring - Monitoring setup
- Debugging - Troubleshooting guide
- Configuration Syntax - Config reference
- API Overview - API patterns
Ready to implement load balancing? See Quick Start β
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)