Comprehensive observability stack for the NEPA microservices architecture, providing metrics, logs, traces, alerting, and SLA monitoring.
The observability stack consists of:
- Prometheus: Metrics collection and storage
- Grafana: Visualization and dashboards
- Loki: Log aggregation
- Promtail: Log shipping
- Jaeger: Distributed tracing
- Alertmanager: Alert management and routing
- OpenTelemetry: Instrumentation framework
# Start all observability services
docker-compose -f docker-compose.observability.yml up -d
# Verify services are running
docker-compose -f docker-compose.observability.yml ps- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- Jaeger: http://localhost:16686
- Alertmanager: http://localhost:9093
Update your service to include observability:
import { OpenTelemetrySetup } from './observability/tracing/OpenTelemetrySetup';
import { createLogger } from './observability/logger/StructuredLogger';
import metricsCollector from './observability/metrics/MetricsCollector';
import TracingMiddleware from './observability/tracing/TracingMiddleware';
// Initialize tracing
const tracing = new OpenTelemetrySetup('user-service');
tracing.start();
// Initialize logger
const logger = createLogger('user-service');
// Add middleware
app.use(TracingMiddleware.middleware());
app.use(metricsCollector.middleware('user-service'));
// Expose metrics endpoint
app.get('/metrics', metricsCollector.getMetricsHandler());import { createLogger } from './observability/logger/StructuredLogger';
const logger = createLogger('payment-service');
// Info logging
logger.info('Payment processed', { paymentId, amount });
// Error logging
logger.error('Payment failed', error, { paymentId });
// Audit logging
logger.audit('payment.created', 'payment', { paymentId, userId });
// Performance logging
logger.performance('process-payment', duration, { paymentId });
// Metric logging
logger.metric('payment.amount', amount, { currency: 'USD' });import TracingMiddleware from './observability/tracing/TracingMiddleware';
// Trace database operations
await TracingMiddleware.traceDatabase(
'findUser',
'SELECT * FROM users WHERE id = $1',
async () => await userClient.user.findUnique({ where: { id } })
);
// Trace HTTP calls
await TracingMiddleware.traceHttpCall(
'POST',
'https://api.stellar.org/payment',
async () => await axios.post(url, data)
);
// Trace custom operations
await TracingMiddleware.traceOperation(
'process-payment',
async (span) => {
span.setAttribute('payment.id', paymentId);
return await processPayment(paymentId);
}
);import metricsCollector from './observability/metrics/MetricsCollector';
// Record payment
metricsCollector.recordPayment('success', 'STELLAR', 'payment-service');
// Record payment duration
metricsCollector.recordPaymentDuration(duration, 'success', 'STELLAR');
// Record bill creation
metricsCollector.recordBillCreated('electricity', 'billing-service');
// Set active users
metricsCollector.setActiveUsers(1250, 'user-service');
// Record database query
metricsCollector.recordDbQuery('SELECT', 'users', duration, 'user-service');
// Record event bus message
metricsCollector.recordEventBusMessage('payment.success', 'published', 'payment-service');
// Record saga execution
metricsCollector.recordSagaExecution('payment-saga', 'success', 'payment-service');import slaMonitor from './observability/monitoring/SLAMonitor';
// Record request
slaMonitor.recordRequest('payment-service', responseTime, success);
// Check SLA
const { met, violations } = slaMonitor.checkSLA('payment-service');
// Get SLA report
const report = slaMonitor.getSLAReport();
// Start monitoring
slaMonitor.startMonitoring(60000); // Check every minuteimport anomalyDetector from './observability/monitoring/AnomalyDetector';
// Add data point
anomalyDetector.addDataPoint('response_time', responseTime);
// Detect anomaly
const anomaly = anomalyDetector.detectAnomaly('response_time', currentValue);
if (anomaly.isAnomaly) {
console.log('Anomaly detected:', anomaly.message);
}
// Detect trend
const { trend, slope } = anomalyDetector.detectTrend('error_rate');
// Predict next value
const predicted = anomalyDetector.predictNextValue('request_count', 1);
// Auto-monitor metric
anomalyDetector.monitorMetric(
'payment_failures',
async () => await getPaymentFailureCount(),
60000
);Alerts are configured in observability/config/alert-rules.yml:
- Service Health: Service down, high error rate, high response time
- Database Health: Connection pool exhausted, high query time, replication lag
- Payment Service: High failure rate, slow processing
- Resource Usage: High CPU, high memory, low disk space
- SLA Monitoring: Availability violations, response time violations
- Anomaly Detection: Unusual traffic patterns, unusual error rates
Pre-configured Grafana dashboards:
- Service Overview: Overall system health
- Service Details: Per-service metrics
- Database Performance: Database metrics
- Payment Analytics: Payment-specific metrics
- SLA Dashboard: SLA compliance tracking
- Error Tracking: Error rates and patterns
# Jaeger
JAEGER_ENDPOINT=http://localhost:14268/api/traces
# Prometheus
PROMETHEUS_URL=http://localhost:9090
# Loki
LOKI_URL=http://localhost:3100
# Alertmanager
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
PAGERDUTY_SERVICE_KEY=your-pagerduty-key
SMTP_USERNAME=alerts@nepa.io
SMTP_PASSWORD=your-password
# Service
SERVICE_NAME=user-service
SERVICE_VERSION=1.0.0
LOG_LEVEL=info- Always use structured logging with correlation IDs
- Trace all external calls (HTTP, database, message queue)
- Record business metrics (payments, bills, users)
- Set up alerts for critical metrics
- Monitor SLAs continuously
- Review dashboards regularly
- Investigate anomalies promptly
- Keep logs for 30 days minimum
# Check if metrics endpoint is accessible
curl http://localhost:3001/metrics
# Check Prometheus targets
open http://localhost:9090/targets# Check Jaeger health
curl http://localhost:14269/
# Verify JAEGER_ENDPOINT is set correctly
echo $JAEGER_ENDPOINT# Check Promtail status
docker logs nepa-promtail
# Verify log file paths in promtail.yml# Backup Prometheus data
docker run --rm -v nepa_prometheus-data:/data -v $(pwd):/backup \
alpine tar czf /backup/prometheus-backup.tar.gz /data# Clean logs older than 30 days
find ./logs -name "*.log" -mtime +30 -delete# Edit alert rules
vim observability/config/alert-rules.yml
# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload- All services expose
/metricsendpoint - All services send traces to Jaeger
- All services use structured logging
- Alerts configured for critical metrics
- Dashboards created for key metrics
- SLA targets defined
- Anomaly detection enabled
- Log retention configured
- Backup strategy in place
- Team trained on observability tools