A comprehensive, containerized monitoring solution for quickly deploying a complete server monitoring environment using industry-standard tools. This stack provides real-time visibility into your infrastructure with minimal setup time.
- Overview
- Features
- Architecture
- Components
- Quick Start
- Configuration
- Dashboard Templates
- Security Considerations
- Maintenance
- Troubleshooting
- Performance Tuning
- Contributing
- License
This monitoring stack provides a comprehensive solution for monitoring server performance, application health, and system metrics. It's designed to be easy to deploy and configure, making it perfect for both development environments and production systems.
The stack leverages Docker containers to provide a consistent, reproducible environment that can be deployed on any infrastructure that supports Docker. All components are preconfigured to work together out of the box, minimizing setup time and complexity.
- Real-time metrics collection - Gather CPU, memory, disk, and network statistics from hosts and containers
- Customizable dashboards - Create visual representations of your system's performance with Grafana
- Alerting system - Get notified when metrics exceed defined thresholds via email, Slack, or other channels
- Long-term storage - Retain historical metrics for trend analysis and capacity planning
- Container monitoring - Track container resource usage and health metrics
- Low resource footprint - Optimized for minimal impact on monitored systems
- API endpoints - Integrate with other tools and services through Prometheus API
- Extensible architecture - Add custom exporters for additional metrics collection
┌─────────────────────────────────────────────────────────────────┐
│ Docker Network │
│ │
│ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ │ │ │ │ │ │
│ │ Prometheus │◄──────┤ Node Exporter │ │ Grafana │ │
│ │ (Metrics │ │ (Host Metrics)│ │ (Dashboards) │ │
│ │ Storage) │ │ │ │ │ │
│ │ │ └───────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌───────────────┐ │ │ │
│ │ │◄──────┤ cAdvisor │ │ │ │
│ │ │ │ (Container │ │ │ │
│ │ │ │ Metrics) │ │ │ │
│ │ │ └───────────────┘ │ │ │
│ │ │◄─────────────────────────────┤ │ │
│ │ │ Query │ │ │
│ └─────────────┘ └──────────────┘ │
│ Port 9090 Port 3001 │
│ │
└─────────────────────────────────────────────────────────────────┘
This architecture provides:
- Centralized metrics collection - Prometheus scrapes metrics from all exporters
- Separation of concerns - Each component has a single, well-defined responsibility
- Scalability - Add more exporters without changing the core architecture
- Resilience - Components can be restarted independently without data loss
The monitoring stack consists of the following components:
Time series database for metrics storage and retrieval. Prometheus acts as the central component, collecting and storing metrics from various sources (exporters). It provides a powerful query language (PromQL) for data analysis.
Visualization and dashboarding tool that connects to Prometheus to display metrics in customizable dashboards. Grafana provides advanced visualization options, alerting capabilities, and user management.
System metrics collection agent that exports hardware and OS metrics from the host system. Node Exporter provides detailed information about CPU, memory, disk, network, and other system resources.
Container metrics collection agent that exports resource usage and performance data from running containers. cAdvisor provides visibility into container CPU, memory, network, and disk usage.
To deploy the monitoring stack, you'll need:
- Docker Engine (version 20.10.0 or later)
- Docker Compose (version 2.0.0 or later)
- At least 1GB of RAM available for the stack
- Ports 3001 (Grafana), 9090 (Prometheus), and 8080 (cAdvisor) available
-
Clone this repository:
git clone https://github.com/yourusername/monitoring-stack.git cd monitoring-stack
-
(Optional) Configure environment variables:
# Copy the example environment file cp .env.example .env # Edit the environment variables as needed nano .env
-
Start the monitoring stack:
docker-compose up -d
-
Verify that all services are running:
docker-compose ps
You should see output similar to:
NAME COMMAND SERVICE STATUS PORTS cadvisor "/usr/bin/cadvisor -…" cadvisor running 0.0.0.0:8080->8080/tcp grafana "/run.sh" grafana running 0.0.0.0:3001->3000/tcp node-exporter "/bin/node_exporter" node-exporter running 0.0.0.0:9100->9100/tcp prometheus "/bin/prometheus --c…" prometheus running 0.0.0.0:9090->9090/tcp
After deployment, you can access the following interfaces:
- Grafana: http://localhost:3001 (default credentials: admin/yourpassword)
- Prometheus: http://localhost:9090
- cAdvisor: http://localhost:8080
To monitor additional servers:
-
Install Node Exporter on each target server:
docker run -d --restart=unless-stopped --name=node-exporter \ -p 9100:9100 \ -v /proc:/host/proc:ro \ -v /sys:/host/sys:ro \ -v /:/rootfs:ro \ --net="host" \ prom/node-exporter:latest \ --path.procfs=/host/proc \ --path.sysfs=/host/sys \ --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)"
-
Add the server to
prometheus/prometheus.yml
:- job_name: 'remote-node' static_configs: - targets: ['your-server-ip:9100'] labels: instance: 'server-name'
-
Reload Prometheus configuration:
curl -X POST http://localhost:9090/-/reload
To configure alerting:
-
Create or edit alert rules in
prometheus/alert.rules.yml
:groups: - name: example rules: - alert: HighCPULoad expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU load (instance {{ $labels.instance }})" description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
Update your Prometheus configuration to include the alert rules:
# In prometheus.yml rule_files: - 'alert.rules.yml'
-
Configure AlertManager in
alertmanager/config.yml
:route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'email-notifications' receivers: - name: 'email-notifications' email_configs: - to: '[email protected]' from: '[email protected]' smarthost: 'smtp.example.com:587' auth_username: 'smtp-user' auth_password: 'smtp-password'
-
Add AlertManager to your docker-compose.yml and restart the stack:
alertmanager: image: prom/alertmanager:latest container_name: alertmanager restart: unless-stopped ports: - "9093:9093" volumes: - ./alertmanager:/etc/alertmanager command: - '--config.file=/etc/alertmanager/config.yml' - '--storage.path=/alertmanager' networks: - monitoring-network
The stack uses environment variables for configuration:
GF_ADMIN_USER
: Grafana admin username (default: admin)GF_ADMIN_PASSWORD
: Grafana admin password (default: yourpassword)
Additional Grafana configuration options can be added as environment variables with the GF_
prefix.
To monitor specific applications or services, add their respective exporters to the docker-compose.yml
file:
mysql-exporter:
image: prom/mysqld-exporter:latest
container_name: mysql-exporter
restart: unless-stopped
ports:
- "9104:9104"
environment:
- DATA_SOURCE_NAME=user:password@(mysql:3306)/
networks:
- monitoring-network
Then add the exporter to your Prometheus configuration:
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
The stack comes with pre-configured dashboards for:
- System Overview: CPU, memory, disk, network metrics for hosts
- Container Performance: Resource usage metrics for Docker containers
- Node Exporter Full: Comprehensive host metrics dashboard
- Prometheus Stats: Monitoring of the monitoring system itself
To import additional dashboards:
- Access Grafana at http://localhost:3001
- Go to "Dashboards" > "Import"
- Enter the dashboard ID or upload the JSON file
- Select the Prometheus data source
- Click "Import"
Popular dashboard IDs:
- Node Exporter Full: 1860
- Docker & System Monitoring: 893
- Prometheus 2.0 Stats: 10000
- Change default credentials for all services
- Consider setting up OAuth or LDAP authentication for Grafana
- Implement API authentication for Prometheus
- Use a reverse proxy (like NGINX) for TLS termination
- Restrict access to management ports
- Configure firewalls to limit access to monitoring services
- Implement regular backups of Prometheus and Grafana data
- Consider encrypting sensitive data in configuration files
- Scrub sensitive information from metrics and logs
server {
listen 80;
server_name monitoring.example.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name monitoring.example.com;
ssl_certificate /etc/letsencrypt/live/monitoring.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/monitoring.example.com/privkey.pem;
location / {
proxy_pass http://localhost:3001;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
To back up the monitoring stack:
# Stop the stack
docker-compose stop
# Back up data volumes
tar -czvf prometheus-data-backup.tar.gz /path/to/prometheus_data
tar -czvf grafana-data-backup.tar.gz /path/to/grafana_data
# Restart the stack
docker-compose start
To restore from backup:
# Stop the stack
docker-compose stop
# Restore data volumes
tar -xzvf prometheus-data-backup.tar.gz -C /path/to/restore
tar -xzvf grafana-data-backup.tar.gz -C /path/to/restore
# Update volume paths in docker-compose.yml if necessary
# Restart the stack
docker-compose start
To update the monitoring stack components:
# Pull the latest images
docker-compose pull
# Restart with the new images
docker-compose up -d
For major version upgrades, check the release notes for each component to ensure compatibility.
Symptoms: Grafana dashboards show "No data" or connection errors
Solution:
- Check if Prometheus is running:
docker-compose ps prometheus
- Verify network connectivity:
docker exec -it grafana ping prometheus
- Check Grafana data source configuration:
- URL should be
http://prometheus:9090
- Access should be set to
Server (default)
- URL should be
Symptoms: Disk space fills up rapidly on the host system
Solution:
- Check Prometheus storage usage:
docker exec -it prometheus du -sh /prometheus
- Adjust retention period in
prometheus.yml
:storage: tsdb: retention.time: 15d
- Consider implementing a downsampling strategy for older metrics
Symptoms: High CPU or memory usage on the host system
Solution:
- Reduce scrape frequency in
prometheus.yml
:global: scrape_interval: 30s # Default is 15s
- Limit container resources in
docker-compose.yml
:prometheus: # ... deploy: resources: limits: cpus: '0.5' memory: 1G
View logs for troubleshooting:
# View logs for all services
docker-compose logs
# View logs for a specific service
docker-compose logs -f prometheus
# View logs with timestamps
docker-compose logs --timestamps
For larger deployments, consider the following optimizations:
-
Adjust scrape intervals based on metric importance:
scrape_configs: - job_name: 'critical-systems' scrape_interval: 10s # ... - job_name: 'non-critical-systems' scrape_interval: 60s # ...
-
Use the
sample_limit
parameter to prevent excessive cardinality:global: scrape_config: sample_limit: 10000
-
Implement federation for large-scale deployments:
scrape_configs: - job_name: 'prometheus-federation' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="node"}' static_configs: - targets: - 'prometheus-secondary-1:9090' - 'prometheus-secondary-2:9090'
For better Grafana performance:
- Limit the time range of dashboards (default view should be last 6-12 hours)
- Use appropriate aggregation functions (rate, increase, avg_over_time)
- Set reasonable refresh intervals (30s or more)
- Consider breaking complex dashboards into multiple simpler dashboards
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch:
git checkout -b feature/amazing-feature
- Commit your changes:
git commit -m 'Add some amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.