diff --git a/README.md b/README.md index 0135848cc7..eb70ad93b6 100644 --- a/README.md +++ b/README.md @@ -54,33 +54,44 @@ accurate energy consumption monitoring for cloud-native workloads. ## πŸš€ Getting Started -> **πŸ“– For comprehensive installation instructions, troubleshooting, and advanced deployment options, see our [Installation Guide](docs/user/installation.md)** +**New to Kepler?** Follow our [**πŸ“– Getting Started Guide**](docs/user/getting-started.md) for quick Kubernetes cluster deployment, or see our [**πŸ§‘β€πŸ’» Developer Getting Started Guide**](docs/developer/getting-started.md) for local development with dashboards. ### ⚑ Quick Start Choose your preferred method: ```bash -# πŸ’» Local Development -make build && sudo ./bin/kepler +# 🎯 Deploy to Kubernetes Cluster (Recommended for users) +helm install kepler manifests/helm/kepler/ --namespace kepler --create-namespace -# ✨ Docker Compose (with Prometheus & Grafana) -cd compose/dev && docker-compose up -d +# πŸ§‘β€πŸ’» Local Development with Dashboards +cd compose/dev && docker compose up -d +# Access Grafana: http://localhost:23000 (admin/admin) -# 🐳 Kubernetes -helm install kepler manifests/helm/kepler/ --namespace kepler --create-namespace +# πŸ—οΈ Local Kubernetes Development +make cluster-up && make deploy + +# πŸ’» Build from Source +make build && sudo ./bin/kepler ``` +> **πŸ“– For detailed installation instructions, troubleshooting, and advanced deployment options, see our [Installation Guide](docs/user/installation.md)** + ## πŸ“– Documentation ### User Documentation -- **[Installation Guide](docs/user/installation.md)** - Detailed installation instructions for all deployment methods -- **[Configuration Guide](docs/user/configuration.md)** - Configuration options and examples +πŸ“‹ **[User Guide Index](docs/user/README.md)** - Complete navigation hub for all user documentation + +- **[Getting Started Guide](docs/user/getting-started.md)** - Quick Kubernetes cluster deployment +- **[Installation Guide](docs/user/installation.md)** - Production deployment methods and enterprise integration +- **[Configuration Guide](docs/user/configuration.md)** - Configuration options and customization examples +- **[Troubleshooting Guide](docs/user/troubleshooting.md)** - Comprehensive problem-solving and debugging guide - **[Metrics Documentation](docs/user/metrics.md)** - Available metrics and their descriptions ### Developer Documentation +- **[Developer Getting Started Guide](docs/developer/getting-started.md)** - Local development setup with Docker Compose, dashboards, and building from source - **[Architecture Documentation](docs/developer/design/architecture/)** - Complete architectural documentation including design principles, system components, data flow, concurrency model, and deployment patterns - **[Power Attribution Guide](docs/developer/power-attribution-guide.md)** - How Kepler measures and attributes power consumption - **[Developer Documentation](docs/developer/)** - Contributing guidelines and development workflow diff --git a/docs/index.md b/docs/README.md similarity index 85% rename from docs/index.md rename to docs/README.md index 616b667b2b..aeeae20d59 100644 --- a/docs/index.md +++ b/docs/README.md @@ -8,6 +8,7 @@ Welcome to the Kepler documentation. Kepler is a Prometheus exporter that measur Documentation for users deploying and operating Kepler: +- [Getting Started Guide](user/getting-started.md) - Quick Kubernetes cluster deployment - [Installation Guide](user/installation.md) - How to install and deploy Kepler - [Configuration Guide](user/configuration.md) - Configuring Kepler for your environment - [Metrics Reference](user/metrics.md) - Available Prometheus metrics exported by Kepler @@ -22,7 +23,7 @@ Documentation for developers contributing to Kepler: ## Quick Start -For a quick start, see the [Installation Guide](user/installation.md) and the main project README. +For a quick start, see the [Getting Started Guide](user/getting-started.md) and the main project README. ## Contributing diff --git a/docs/developer/index.md b/docs/developer/README.md similarity index 93% rename from docs/developer/index.md rename to docs/developer/README.md index b1a7f9b392..12c9cb010a 100644 --- a/docs/developer/index.md +++ b/docs/developer/README.md @@ -24,6 +24,7 @@ This section contains documentation for developers working on Kepler. ## Development Workflow +- [Developer Getting Started](getting-started.md) - Building from source and setting up development environments - [Pre-commit Setup](pre-commit.md) - Setting up pre-commit hooks for code quality ## Release Management diff --git a/docs/developer/design/architecture/index.md b/docs/developer/design/architecture/README.md similarity index 98% rename from docs/developer/design/architecture/index.md rename to docs/developer/design/architecture/README.md index dc347166d3..a4760f3a5e 100644 --- a/docs/developer/design/architecture/index.md +++ b/docs/developer/design/architecture/README.md @@ -84,7 +84,7 @@ For specific implementation work: ## Related Documentation - **[Power Attribution Guide](../../power-attribution-guide.md)**: Detailed explanation of power calculation methodology -- **[Development Setup](../../index.md)**: Setting up the development environment +- **[Development Setup](../../README.md)**: Setting up the development environment - **[User Configuration Guide](../../../user/configuration.md)**: End-user configuration options --- diff --git a/docs/developer/proposal/index.md b/docs/developer/proposal/README.md similarity index 100% rename from docs/developer/proposal/index.md rename to docs/developer/proposal/README.md diff --git a/docs/user/README.md b/docs/user/README.md new file mode 100644 index 0000000000..39136a74e7 --- /dev/null +++ b/docs/user/README.md @@ -0,0 +1,198 @@ +# Kepler User Documentation + +Welcome to Kepler user documentation! This directory contains everything you need to deploy, configure, and monitor energy consumption with Kepler. + +## πŸ—ΊοΈ Documentation Overview + +| Guide | Purpose | Target Audience | Time Required | +|-------|---------|-----------------|---------------| +| **[Getting Started](getting-started.md)** | Quick Kubernetes deployment | New users, cluster operators | 5-10 minutes | +| **[Installation](installation.md)** | Production deployment | DevOps, SRE, platform teams | 30-60 minutes | +| **[Configuration](configuration.md)** | Customize Kepler settings | Advanced users, ops teams | As needed | +| **[Metrics](metrics.md)** | Understand available metrics | Monitoring teams, developers | Reference | +| **[Troubleshooting](troubleshooting.md)** | Diagnose and fix issues | All users | As needed | + +## πŸš€ Quick Start + +**New to Kepler?** Start here: + +1. **[Getting Started Guide](getting-started.md)** - Deploy Kepler to your Kubernetes cluster in under 10 minutes +2. **[Access your metrics](getting-started.md#access-metrics)** - Verify energy data collection +3. **Choose your next step** based on your needs: + - Production deployment β†’ [Installation Guide](installation.md) + - Customize settings β†’ [Configuration Guide](configuration.md) + - Having issues? β†’ [Troubleshooting Guide](troubleshooting.md) + +**Want to try Kepler locally first?** See our [**πŸ§‘β€πŸ’» Developer Getting Started Guide**](../developer/getting-started.md) for Docker Compose setup with pre-configured dashboards. + +## πŸ“‹ Choose Your Path + +### 🎯 I want to deploy Kepler to my cluster + +**β†’ [Getting Started Guide](getting-started.md)** + +- Quick Helm installation (5 minutes) +- Deploy to existing Kubernetes cluster +- Verify energy metrics collection +- Production-ready deployment path + +### πŸ—οΈ I need to deploy Kepler in production + +**β†’ [Installation Guide](installation.md)** + +- Helm installation (recommended) +- kubectl/kustomize deployment +- Enterprise integration (RBAC, network policies) +- Multi-cluster and high availability setup + +### βš™οΈ I need to customize Kepler configuration + +**β†’ [Configuration Guide](configuration.md)** + +- All configuration options explained +- Command-line flags vs config file +- Monitoring, logging, and export settings +- Kubernetes integration options +- Development features (fake CPU meter) + +### πŸ“Š I want to understand Kepler metrics + +**β†’ [Metrics Reference](metrics.md)** + +- Complete metrics catalog +- Node, container, process, VM, and pod level metrics +- Metric types and labels +- Power vs energy measurements +- Integration with Prometheus + +### πŸ” I'm having problems with Kepler + +**β†’ [Troubleshooting Guide](troubleshooting.md)** + +- Quick health checks +- Docker Compose issues +- Kubernetes deployment problems +- Configuration and metrics issues +- Advanced debugging techniques + +## πŸ›€οΈ Learning Progression + +### Beginner Path + +1. **[Getting Started](getting-started.md)** - Deploy to Kubernetes cluster +2. **[Understanding Metrics](metrics.md)** - Learn what you're measuring +3. **[Basic Configuration](configuration.md#configuration-methods)** - Simple customization + +### Local Development Path + +1. **[Developer Getting Started Guide](../developer/getting-started.md)** - Docker Compose with dashboards +2. **[Getting Started](getting-started.md)** - Deploy to cluster when ready +3. **[Configuration Guide](configuration.md)** - Customize for your needs + +### Intermediate Path + +1. **[Installation Guide](installation.md)** - Production deployment +2. **[Advanced Configuration](configuration.md#configuration-options-in-detail)** - Fine-tune settings +3. **[Troubleshooting](troubleshooting.md)** - Handle common issues + +### Advanced Path + +1. **[Enterprise Integration](installation.md#enterprise-integration)** - RBAC, security +2. **[Performance Tuning](configuration.md#monitor-configuration)** - Optimize for scale +3. **[Advanced Debugging](troubleshooting.md#advanced-debugging)** - Deep troubleshooting + +## 🎯 Common Use Cases + +### "I want to see energy monitoring in action" + +**Solution:** [Developer Getting Started - Docker Compose](../developer/getting-started.md#docker-compose-development-setup) + +- Complete monitoring stack with dashboards +- Local development environment +- 5-minute setup + +### "I need Kepler in my Kubernetes cluster" + +**Solution:** [Getting Started - Helm Installation](getting-started.md#quick-installation-with-helm) + +- Quick cluster deployment (5 minutes) +- Then [Installation Guide](installation.md#helm-installation-recommended) for production config +- Integrates with existing monitoring + +### "Power metrics are missing or incorrect" + +**Solution:** [Troubleshooting - Metrics Issues](troubleshooting.md#metrics-and-monitoring-issues) + +- Hardware support checks +- Fake CPU meter for testing +- Attribution troubleshooting + +### "I want to customize how Kepler works" + +**Solution:** [Configuration Guide](configuration.md) + +- All configuration options +- Environment-specific settings +- Development vs production configs + +### "Kepler isn't working as expected" + +**Solution:** [Troubleshooting Guide](troubleshooting.md) + +- Quick diagnostics +- Platform-specific issues +- Step-by-step problem solving + +## πŸ“š Related Documentation + +### Developer Resources + +- **[Developer Documentation](../developer/)** - Contributing, development setup +- **[Architecture Guide](../developer/design/architecture/)** - How Kepler works internally +- **[API Documentation](../developer/)** - Technical implementation details + +### Project Resources + +- **[Main README](../../README.md)** - Project overview +- **[Contributing Guide](../../CONTRIBUTING.md)** - How to contribute +- **[Governance](../../GOVERNANCE.md)** - Project governance + +## πŸ†˜ Getting Help + +### Self-Service Resources + +1. **[Troubleshooting Guide](troubleshooting.md)** - Most common issues covered +2. **[Configuration Reference](configuration.md)** - All settings explained +3. **[Metrics Documentation](metrics.md)** - Understanding the data + +### Community Support + +- **πŸ› GitHub Issues:** [Report bugs or request features](https://github.com/sustainable-computing-io/kepler/issues) +- **πŸ’¬ GitHub Discussions:** [Ask questions and share experiences](https://github.com/sustainable-computing-io/kepler/discussions) +- **πŸ—¨οΈ CNCF Slack:** [Real-time community chat](https://cloud-native.slack.com/archives/C06HYDN4A01) + +### Before Asking for Help + +1. Check the [Troubleshooting Guide](troubleshooting.md) for your issue +2. Search [existing issues](https://github.com/sustainable-computing-io/kepler/issues) +3. Gather logs and configuration (see [troubleshooting checklist](troubleshooting.md#before-asking-for-help)) + +## πŸ”„ Documentation Updates + +This documentation is actively maintained. If you find: + +- **Outdated information** - Please [open an issue](https://github.com/sustainable-computing-io/kepler/issues/new) +- **Missing content** - Contributions welcome via [pull request](https://github.com/sustainable-computing-io/kepler/pulls) +- **Unclear instructions** - Let us know in [discussions](https://github.com/sustainable-computing-io/kepler/discussions) + +## πŸ“ˆ What's Next? + +After mastering the user guides: + +- **[Join the community](https://github.com/sustainable-computing-io/kepler/discussions)** - Share your experiences +- **[Contribute improvements](../../CONTRIBUTING.md)** - Help make Kepler better +- **[Try advanced features](../developer/)** - Explore cutting-edge capabilities + +--- + +**Happy energy monitoring with Kepler!** ⚑ diff --git a/docs/user/configuration.md b/docs/user/configuration.md index 1c803e33f0..785a24c00f 100644 --- a/docs/user/configuration.md +++ b/docs/user/configuration.md @@ -299,6 +299,46 @@ dev: - `enabled`: Set to `true` to enable fake CPU meter - `zones`: Specific zones to enable, empty enables all +## πŸ”§ Fake CPU Meter Configuration + +The fake CPU meter is a development/testing feature that generates synthetic power data +when real hardware power measurements aren't available. + +### When to Use Fake CPU Meter + +- **Virtual Machines** - VMs typically don't expose hardware power sensors +- **Non-Intel Systems** - AMD or ARM systems without RAPL support +- **Cloud Instances** - Most cloud VMs don't have access to power sensors +- **Development/Testing** - When you want to see Kepler working without hardware +- **CI/CD Environments** - Automated testing of Kepler functionality + +### Configuration Options + +```yaml +dev: + fake-cpu-meter: + enabled: true # Enable synthetic power measurements + zones: [] # Zones to simulate (empty = all default zones) +``` + +### What Fake CPU Meter Provides + +The fake CPU meter generates realistic synthetic data: + +- **Simulated CPU power usage** - Based on actual CPU utilization patterns +- **Consistent metric structure** - All the same metrics as real hardware +- **Realistic value ranges** - Power values similar to actual hardware (10-200W typical) +- **Proper workload attribution** - Correctly attributes power to containers/processes +- **Time-based variation** - Power values change realistically with load + +### Limitations and Considerations + +⚠️ **Important Limitations:** + +- **Not real measurements** - Data is synthetic approximation, not actual power consumption +- **Development only** - Never use for production monitoring, billing, or real optimization decisions +- **No hardware insights** - Won't help identify actual hardware-specific power characteristics + ## πŸ“– Further Reading For more details see the [config file](../../hack/config.yaml) diff --git a/docs/user/getting-started.md b/docs/user/getting-started.md new file mode 100644 index 0000000000..3c9fc56da9 --- /dev/null +++ b/docs/user/getting-started.md @@ -0,0 +1,168 @@ +# Getting Started with Kepler + +Get up and running with Kepler in your Kubernetes cluster in minutes! This +guide walks you through deploying Kepler for energy consumption monitoring. + +## What is Kepler? + +Kepler (Kubernetes-based Efficient Power Level Exporter) measures energy +consumption at the container, pod, and node level. By the end of this guide, +you'll have Kepler running in your Kubernetes cluster and collecting energy metrics. + +## What You'll Accomplish + +- βœ… Deploy Kepler to your Kubernetes cluster using Helm +- βœ… Verify Kepler is collecting energy consumption metrics +- βœ… Access metrics via port-forward or integrate with your monitoring stack +- βœ… Understand next steps for dashboards and production configuration + +## Prerequisites + +- **Kubernetes cluster** (v1.20+) with kubectl access +- **Helm 3.0+** installed +- **Admin permissions** for cluster installation +- **Intel RAPL support** on cluster nodes (or see [fake CPU meter](#running-without-hardware-support) for testing) + +## Quick Installation with Helm + +### Step 1: Clone Repository + +```bash +# Clone the repository +git clone https://github.com/sustainable-computing-io/kepler.git +cd kepler +``` + +### Step 2: Install Kepler + +```bash +# Install Kepler with Helm +helm install kepler manifests/helm/kepler/ \ + --namespace kepler \ + --create-namespace +``` + +This deploys Kepler as a DaemonSet to all nodes in your cluster. + +### Step 3: Verify Installation + +```bash +# Check Kepler pods are running +kubectl get pods -n kepler + +# Should show kepler-* pods in Running state on each node +kubectl get daemonset -n kepler + +# Check for any issues +kubectl describe pods -n kepler +``` + +### Step 4: Access Metrics + +```bash +# Port forward to access Kepler metrics +kubectl port-forward -n kepler svc/kepler 28282:28282 & + +# View available metrics +curl http://localhost:28282/metrics | grep kepler_node_cpu_watts + +# Check that metrics are being collected +curl -s http://localhost:28282/metrics | grep -c "kepler_" +``` + +You should see metrics like: + +- `kepler_node_cpu_watts` - CPU power consumption at node level +- `kepler_container_cpu_watts` - Per-container power usage +- `kepler_pod_cpu_watts` - Per-pod power usage + +--- + +## Understanding Your Metrics + +Once Kepler is running, you'll see these key metric types: + +### πŸ”‹ Power Metrics (Watts) + +- **kepler_node_cpu_watts** - Instantaneous CPU power consumption +- **kepler_container_cpu_watts** - Power attributed to containers +- **kepler_pod_cpu_watts** - Power attributed to pods + +### ⚑ Energy Metrics (Joules) + +- **kepler_node_cpu_joules_total** - Cumulative energy consumed +- **kepler_container_cpu_joules_total** - Energy per container over time + +### πŸ’‘ Key Concepts + +- **Watts (W)** - Instantaneous power consumption (like speedometer) +- **Joules (J)** - Total energy consumed over time (like odometer) +- **Attribution** - How Kepler estimates which workload used energy + +--- + +## Running Without Hardware Support + +⚠️ **WARNING: For Development/Testing/Experimental Purposes Only** + +If you're running on a cluster without Intel RAPL support (VMs, non-Intel hardware, +cloud instances), you can enable the fake CPU meter to generate synthetic +power data for experimentation. + +**Important:** This feature is experimental and generates simulated data only + +- never use for production monitoring or real optimization decisions. + +For complete fake CPU meter setup instructions, see +the **[Fake CPU Meter Configuration section](configuration.md#fake-cpu-meter-configuration)** in our Configuration Guide. + +--- + +## Local Development Setup + +πŸ§‘β€πŸ’» **Want to run Kepler locally for development?** + +This guide focuses on deploying Kepler to Kubernetes clusters. For local +development with Docker Compose, complete monitoring stacks, and development +workflows, see our comprehensive **[Developer Getting Started Guide](../developer/getting-started.md)**. + +The developer guide includes: + +- **Docker Compose setup** with Prometheus & Grafana dashboards +- **make cluster-up** for local Kubernetes development +- **Building from source** and development workflows +- **Local testing** with fake CPU meter + +--- + +## Next Steps + +πŸŽ‰ **Congratulations!** You now have Kepler running and collecting energy metrics. + +### Immediate Next Steps + +1. **πŸ“Š Set up dashboards** - Integrate with Grafana to visualize your energy data +2. **πŸ“ˆ Configure monitoring** - Connect to your existing Prometheus setup +3. **πŸ”§ Verify metrics** - Ensure data collection is working properly + +### Ready for More? + +1. **πŸ—οΈ Production Deployment** - Ready for production? See our [Installation Guide](installation.md) for advanced Helm configurations, enterprise integration, and production best practices +2. **βš™οΈ Advanced Configuration** - Want to customize Kepler? Check the [Configuration Guide](configuration.md) for all configuration options +3. **πŸ“Š Metrics Deep Dive** - Understand all available metrics in the [Metrics Reference](metrics.md) +4. **πŸ” Having Problems?** - Check our [Troubleshooting Guide](troubleshooting.md) for common issues and solutions + +### Join the Community + +- **πŸ› Issues:** [GitHub Issues](https://github.com/sustainable-computing-io/kepler/issues) +- **πŸ’¬ Discussions:** [GitHub Discussions](https://github.com/sustainable-computing-io/kepler/discussions) +- **πŸ—¨οΈ Slack:** [#kepler in CNCF Slack](https://cloud-native.slack.com/archives/C06HYDN4A01) + +### Want to Contribute? + +- **πŸ”§ Developer docs:** [docs/developer/](../developer/) +- **πŸ“‹ Contributing guide:** [CONTRIBUTING.md](../../CONTRIBUTING.md) + +--- + +⚑ Happy energy monitoring with Kepler! ⚑ diff --git a/docs/user/installation.md b/docs/user/installation.md index 5e70b4c166..2fe11ae53c 100644 --- a/docs/user/installation.md +++ b/docs/user/installation.md @@ -1,21 +1,20 @@ # Kepler Installation Guide -This guide covers different methods to install and run Kepler (Kubernetes-based Efficient Power Level Exporter) for monitoring energy consumption metrics. +This guide covers production-ready installation methods for deploying Kepler +to Kubernetes clusters. For local development and testing setups, +see our [**Developer Getting Started Guide**](../developer/getting-started.md). ## Prerequisites -- **For Local Installation**: Go 1.21+ and sudo access for hardware sensor access -- **For Kubernetes**: Kubernetes cluster (v1.20+) with kubectl configured -- **For Helm**: Helm 3.0+ installed +- **Kubernetes cluster** (v1.20+) with kubectl configured +- **Admin access** for creating namespaces and RBAC resources +- **Helm 3.0+** (recommended) or kubectl with kustomize support -## Installation Methods +## Deployment Methods -### 1. Helm Chart Installation (Recommended for Kubernetes) +### 1. Helm Installation (Recommended) -#### Prerequisites for Helm - -- Helm 3.0+ -- Kubernetes cluster with kubectl configured +Helm provides the most flexible and user-friendly way to deploy Kepler to production Kubernetes clusters. #### Install from Source @@ -103,79 +102,15 @@ helm upgrade kepler manifests/helm/kepler/ -n kepler helm uninstall kepler -n kepler ``` -### 2. Local Installation +### 2. kubectl/Kustomize Deployment -#### Building from Source +For environments requiring manual control or GitOps integration: ```bash -# Clone the repository +# Clone the repository for manifest access git clone https://github.com/sustainable-computing-io/kepler.git cd kepler -# Build Kepler -make build - -# Run Kepler (requires sudo for hardware access) -sudo ./bin/kepler --config.file hack/config.yaml -``` - -#### Configuration - -Kepler can be configured using YAML files or CLI flags. The default configuration is in `hack/config.yaml`: - -```bash -# Run with custom configuration -sudo ./bin/kepler --config.file /path/to/your/config.yaml - -# Run with CLI flags -sudo ./bin/kepler --log.level=debug --exporter.stdout -``` - -**Access Points:** - -- Metrics: - -### 3. Docker Compose (Recommended for Development) - -The Docker Compose setup provides a complete monitoring stack with Kepler, Prometheus, and Grafana: - -```bash -cd compose/dev - -# Start the complete stack -docker compose up --build -d - -# View logs -docker compose logs -f kepler-dev - -# Stop the stack -docker compose down --volumes -``` - -**Access Points:** - -- Kepler Metrics: -- Prometheus: -- Grafana: (admin/admin) - -### 4. Kubernetes with Kustomize - -#### Quick Setup with Kind - -```bash -# Create a local cluster with monitoring stack -make cluster-up - -# Deploy Kepler -make deploy - -# Clean up -make cluster-down -``` - -#### Manual Kubernetes Deployment - -```bash # Deploy using kustomize kubectl kustomize manifests/k8s | \ sed -e "s||quay.io/sustainable_computing_io/kepler:latest|g" | \ @@ -188,14 +123,32 @@ kubectl get pods -n kepler kubectl port-forward -n kepler svc/kepler 28282:28282 ``` -#### Custom Image Deployment +#### Custom Kustomization -```bash -# Build and push custom image -make image push IMG_BASE=your-registry.com/yourorg VERSION=v1.0.0 +For advanced users requiring specific configurations: + +```yaml +# kustomization.yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: kepler -# Deploy with custom image -make deploy IMG_BASE=your-registry.com/yourorg VERSION=v1.0.0 +resources: +- https://github.com/sustainable-computing-io/kepler/manifests/k8s + +patchesStrategicMerge: +- resource-limits.yaml + +images: +- name: quay.io/sustainable_computing_io/kepler + newTag: v0.10.0 +``` + +Deploy with custom kustomization: + +```bash +kubectl apply -k . ``` ## Verification @@ -276,11 +229,26 @@ serviceMonitor: scrapeTimeout: 10s ``` -### Environment-Specific Settings +### Production Considerations + +- **Hardware Requirements**: Intel RAPL support (most Intel processors since 2011) +- **Security Context**: Kepler requires privileged access for hardware monitoring +- **Resource Planning**: Minimum 100m CPU, 200Mi memory per node +- **Monitoring Integration**: Configure ServiceMonitor for Prometheus scraping +- **High Availability**: Deploy across multiple nodes with appropriate tolerations + +## Local Development Setup + +πŸ§‘β€πŸ’» **Want to try Kepler locally first?** + +This guide focuses on production Kubernetes deployments. For local development, testing, and learning setups including Docker Compose with dashboards, see our comprehensive [**Developer Getting Started Guide**](../developer/getting-started.md). + +The developer guide includes: -- **Development**: Use fake CPU meter when RAPL unavailable -- **Production**: Ensure nodes have Intel RAPL support -- **Cloud**: May need different privilege configurations +- **Docker Compose** setup with Prometheus & Grafana dashboards +- **make cluster-up** for local Kubernetes development +- **Building from source** and development workflows +- **Fake CPU meter** configuration for systems without RAPL support ## Troubleshooting @@ -303,9 +271,6 @@ kubectl describe pod -n kepler -l app.kubernetes.io/name=kepler # Check node hardware kubectl exec -n kepler -it -- ls /sys/class/powercap/intel-rapl -# Test with fake meter (development) -helm upgrade kepler manifests/helm/kepler/ -n kepler \ - --set env.KEPLER_FAKE_CPU_METER=true ``` ### Getting Help @@ -319,8 +284,7 @@ helm upgrade kepler manifests/helm/kepler/ -n kepler \ After successful installation: -1. **Set up Prometheus**: Configure scraping of Kepler metrics -2. **Install Grafana**: Use pre-built dashboards for visualization -3. **Configure Alerts**: Set up energy consumption alerts -4. **Explore Metrics**: Learn about available energy metrics -5. **Optimize Workloads**: Use insights to improve energy efficiency +1. **πŸ“Š Set up monitoring** - Configure Prometheus scraping and Grafana dashboards +2. **πŸ”§ Customize configuration** - Review [Configuration Guide](configuration.md) for environment-specific settings +3. **πŸ“ˆ Analyze metrics** - Learn about available metrics in our [Metrics Reference](metrics.md) +4. **🚨 Plan troubleshooting** - Familiarize yourself with our [Troubleshooting Guide](troubleshooting.md) diff --git a/docs/user/troubleshooting.md b/docs/user/troubleshooting.md new file mode 100644 index 0000000000..6af2cecf07 --- /dev/null +++ b/docs/user/troubleshooting.md @@ -0,0 +1,575 @@ +# Kepler Troubleshooting Guide + +This guide helps you diagnose and fix common issues when running Kepler. Whether +you're using Docker Compose, Kubernetes, or having configuration problems, +you'll find solutions here. + +## 🩺 Quick Health Check + +Start with these commands to quickly assess Kepler's status: + +### Docker Compose Health Check + +```bash +# Check service status +docker compose ps + +# Check Kepler logs +docker compose logs kepler-dev + +# Test metrics endpoint +curl -f http://localhost:28282/metrics >/dev/null && echo "βœ… Metrics OK" || echo "❌ Metrics failed" + +# Test Grafana access +curl -f http://localhost:23000 >/dev/null && echo "βœ… Grafana OK" || echo "❌ Grafana failed" +``` + +### Kubernetes Health Check + +```bash +# Check pod status +kubectl get pods -n kepler + +# Check logs for errors +kubectl logs -n kepler -l app.kubernetes.io/name=kepler --tail=50 + +# Test metrics endpoint +kubectl port-forward -n kepler svc/kepler 28282:28282 & +curl -f http://localhost:28282/metrics >/dev/null && echo "βœ… Metrics OK" || echo "❌ Metrics failed" +``` + +--- + +## 🐳 Docker Compose Issues + +### Services Won't Start + +**Symptoms:** `docker compose ps` shows services as "Exit 1" or "Restarting" + +**Diagnosis:** + +```bash +# Check what's failing +docker compose ps + +# Check logs for the failing service +docker compose logs [service-name] + +# Check port conflicts +netstat -tlnp | grep -E ":(23000|28282|9090)" +``` + +**Solutions:** + +1. **Port Conflicts:** + +```bash +# Stop conflicting services +sudo lsof -i :23000 # Find what's using port 23000 +sudo kill -9 # Kill the process + +# Or change ports in compose.yaml +``` + +1. **Insufficient Resources:** + +```bash +# Check available resources +docker system df +docker system prune # Free up space if needed + +# Check memory usage +free -h +``` + +1. **Permission Issues:** + +```bash +# Fix docker permissions (Linux) +sudo usermod -aG docker $USER +newgrp docker +``` + +### No Metrics Data + +**Symptoms:** Kepler starts but `/metrics` endpoint is empty or shows no `kepler_*` metrics + +**Diagnosis:** + +```bash +# Check if hardware is supported +docker compose exec kepler-dev ls /sys/class/powercap/intel-rapl/ 2>/dev/null || echo "No RAPL support" + +# Check fake meter status +docker compose logs kepler-dev | grep -i fake +``` + +**Solutions:** + +1. **Enable Fake CPU Meter (for VMs/testing):** + +```bash +# Edit config to enable fake meter +sed -i 's/enabled: false/enabled: true/' kepler-dev/etc/kepler/config.yaml + +# Restart Kepler +docker compose restart kepler-dev +``` + +1. **Hardware Access Issues:** + +```bash +# Check container privileges +docker compose exec kepler-dev ls -la /proc/stat +docker compose exec kepler-dev ls -la /sys/class/powercap/ +``` + +### Dashboard Shows No Data + +**Symptoms:** Grafana loads but dashboards are empty + +**Diagnosis:** + +```bash +# Check if Prometheus is scraping Kepler +curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kepler")' + +# Check if metrics are available +curl http://localhost:28282/metrics | grep kepler_node_cpu_watts +``` + +**Solutions:** + +1. **Prometheus Configuration:** + +```bash +# Restart Prometheus to reload config +docker compose restart prometheus + +# Check scrape targets in Prometheus UI +# Open http://localhost:9090/targets +``` + +1. **Wait for Data Collection:** + +```bash +# Kepler needs time to collect data (wait 1-2 minutes) +# Check metrics are increasing +curl http://localhost:28282/metrics | grep kepler_node_cpu_joules_total +``` + +--- + +## ☸️ Kubernetes Issues + +### Pods Not Starting + +**Symptoms:** `kubectl get pods -n kepler` shows pods in `Pending`, `CrashLoopBackOff`, or `Error` + +**Diagnosis:** + +```bash +# Check pod status details +kubectl describe pods -n kepler + +# Check events +kubectl get events -n kepler --sort-by=.metadata.creationTimestamp + +# Check node resources +kubectl top nodes +``` + +**Solutions:** + +1. **Resource Constraints:** + +```bash +# Check node resources +kubectl describe node | grep -A 5 "Allocated resources" + +# Adjust resource requests in deployment +kubectl edit daemonset -n kepler kepler +``` + +1. **Security/Privilege Issues:** + +```bash +# Verify security context +kubectl get daemonset -n kepler -o yaml | grep -A 10 securityContext + +# Check if privileged access is enabled +kubectl get daemonset -n kepler -o yaml | grep privileged +``` + +1. **Node Selector/Tolerations:** + +```bash +# Check node labels and taints +kubectl describe nodes | grep -E "(Taints|Labels)" + +# Update tolerations if needed +kubectl patch daemonset kepler -n kepler -p '{"spec":{"template":{"spec":{"tolerations":[{"operator":"Exists"}]}}}}' +``` + +### Permission Denied Errors + +**Symptoms:** Kepler pods show permission errors in logs + +**Diagnosis:** + +```bash +# Check common permission issues +kubectl logs -n kepler -l app.kubernetes.io/name=kepler | grep -i "permission denied" + +# Check security context +kubectl get pods -n kepler -o yaml | grep -A 10 securityContext +``` + +**Solutions:** + +1. **Fix Security Context:** + +```bash +# Ensure privileged access for hardware access +kubectl patch daemonset kepler -n kepler -p '{"spec":{"template":{"spec":{"containers":[{"name":"kepler","securityContext":{"privileged":true}}]}}}}' +``` + +1. **Check Host Path Permissions:** + +```bash +# Verify host paths are accessible +kubectl exec -n kepler -it -- ls -la /host/proc/stat +kubectl exec -n kepler -it -- ls -la /host/sys/class/powercap/ +``` + +### No Hardware Support + +**Symptoms:** Logs show "No hardware support found" or similar + +**Diagnosis:** + +```bash +# Check for hardware power measurement support +kubectl exec -n kepler -it -- find /sys/class/powercap/ -name "energy_uj" | head -5 + +# Check CPU info +kubectl exec -n kepler -it -- grep -i "model name" /proc/cpuinfo | head -1 +``` + +**Solutions:** + +1. **Enable Fake CPU Meter for Testing:** + +```bash +# Update configmap to enable fake meter +kubectl edit configmap kepler-config -n kepler + +``` + +1. Update `config.yaml` to enable fake meter as follows + +```yaml +dev: + fake-cpu-meter: + enabled: false + zones: [] # Zones to be enabled, empty enables all default zones + +``` + +1. Restart pods to pick up changes + +```bash +kubectl rollout restart daemonset/kepler -n kepler +``` + +--- + +## πŸ“Š Metrics and Monitoring Issues + +### Missing or Zero Metrics + +**Symptoms:** Some metrics are missing or always show zero values + +**Diagnosis:** + +```bash +# Check available metrics +curl -s http://localhost:28282/metrics | grep ^kepler | wc -l + +# Check specific metric patterns +curl -s http://localhost:28282/metrics | grep -E "(kepler_node|kepler_container|kepler_process)_cpu_watts" + +# Check metric labels +curl -s http://localhost:28282/metrics | grep kepler_node_cpu_watts | head -3 +``` + +**Solutions:** + +1. **Enable More Metric Levels:** + +```bash +# For Docker Compose - check config.yaml +grep -A 10 "metricsLevel:" kepler-dev/etc/kepler/config.yaml + +# For Kubernetes - update configmap +kubectl get configmap kepler-config -n kepler -o yaml +``` + +1. **Wait for Data Collection:** + +```bash +# Metrics need time to accumulate +# Check that counters are increasing over time +curl -s http://localhost:28282/metrics | grep kepler_node_cpu_joules_total +sleep 30 +curl -s http://localhost:28282/metrics | grep kepler_node_cpu_joules_total +``` + +### High Memory Usage + +**Symptoms:** Kepler consumes excessive memory + +**Diagnosis:** + +```bash +# Check memory usage +docker stats kepler-dev # For Docker Compose +kubectl top pods -n kepler # For Kubernetes + +# Check terminated workload tracking +curl -s http://localhost:28282/metrics | grep kepler_terminated_ +``` + +**Solutions:** + +1. **Adjust Terminated Workload Limits:** + +```bash +# Edit configuration to limit terminated workload tracking +# Set maxTerminated to lower value (default: 500) +# Set minTerminatedEnergyThreshold higher (default: 10) +``` + +1. **Reduce Monitoring Scope:** + +```bash +# Disable process-level monitoring if not needed +# Edit config to remove "process" from metricsLevel +``` + +### Inconsistent Power Attribution + +**Symptoms:** Power values seem incorrect or inconsistent + +**Diagnosis:** + +```bash +# Check if fake meter is inadvertently enabled +curl -s http://localhost:28282/metrics | grep kepler_build_info + +# Verify hardware support +ls -la /sys/class/powercap/intel-rapl*/energy_uj 2>/dev/null || echo "No RAPL support" +``` + +**Solutions:** + +1. **Ensure Real Hardware Mode:** + +```bash +# Disable fake CPU meter if accidentally enabled +# Check config.yaml for dev.fake-cpu-meter.enabled: false +``` + +1. **Calibrate Expectations:** + +```bash +# Power attribution is estimated based on resource usage +# Values are approximate, not precise measurements +# Compare trends rather than absolute values +``` + +--- + +## βš™οΈ Configuration Issues + +### Configuration Not Loading + +**Symptoms:** Changes to config.yaml don't take effect + +**Diagnosis:** + +```bash +# Check if config file is being read +# Look for config-related log messages +docker compose logs kepler-dev | grep -i config # Docker Compose +kubectl logs -n kepler -l app.kubernetes.io/name=kepler | grep -i config # Kubernetes +``` + +**Solutions:** + +1. **Docker Compose:** + +```bash +# Restart after config changes +docker compose restart kepler-dev + +# Verify config file is mounted correctly +docker compose exec kepler-dev cat /etc/kepler/config.yaml +``` + +1. **Kubernetes:** + +```bash +# Restart pods after configmap changes +kubectl rollout restart daemonset/kepler -n kepler + +# Verify configmap is updated +kubectl get configmap kepler-config -n kepler -o yaml +``` + +### Invalid YAML Configuration + +**Symptoms:** Kepler fails to start with YAML parsing errors + +**Diagnosis:** + +```bash +# Check YAML syntax +python -c "import yaml; yaml.safe_load(open('config.yaml'))" 2>&1 || echo "Invalid YAML" + +# Check logs for specific parsing errors +docker compose logs kepler-dev | grep -i "yaml\|parse\|config" +``` + +**Solutions:** + +1. **Validate YAML:** + +```bash +# Use online YAML validator or: +python -m yaml config.yaml # If PyYAML is installed +``` + +1. **Reset to Default:** + +```bash +# Copy default config from repository +curl -o config.yaml https://raw.githubusercontent.com/sustainable-computing-io/kepler/main/hack/config.yaml +``` + +--- + +## πŸ” Advanced Debugging + +### Enable Debug Logging + +```bash +# Docker Compose - edit config.yaml +sed -i 's/level: info/level: debug/' kepler-dev/etc/kepler/config.yaml +docker compose restart kepler-dev + +# Kubernetes - update configmap +kubectl patch configmap kepler-config -n kepler --type merge -p '{"data":{"config.yaml":"log:\n level: debug\n format: text\n..."}}' +kubectl rollout restart daemonset/kepler -n kepler +``` + +### Use pprof for Performance Analysis + +```bash +# Enable pprof (usually enabled by default) +curl http://localhost:28282/debug/pprof/goroutine?debug=1 + +# Get heap profile +curl -o heap.prof http://localhost:28282/debug/pprof/heap + +# Analyze with go tool (if available) +go tool pprof heap.prof +``` + +### Network Connectivity Issues + +```bash +# Test internal connectivity +kubectl exec -n kepler -it -- netstat -tlnp + +# Check DNS resolution +kubectl exec -n kepler -it -- nslookup kubernetes.default.svc.cluster.local + +# Test external connectivity (if needed) +kubectl exec -n kepler -it -- curl -I https://google.com +``` + +--- + +## πŸ†˜ Getting Help + +If you're still experiencing issues after trying these solutions: + +### Before Asking for Help + +1. **Gather Information:** + +```bash +# System information +uname -a +docker --version # or kubectl version +``` + +1. **Collect Logs:** + +```bash +# Save relevant logs +docker compose logs kepler-dev > kepler-logs.txt # Docker Compose +kubectl logs -n kepler -l app.kubernetes.io/name=kepler > kepler-logs.txt # Kubernetes +``` + +1. **Configuration:** + +```bash +# Export current configuration +docker compose exec kepler-dev cat /etc/kepler/config.yaml > current-config.yaml # Docker Compose +kubectl get configmap kepler-config -n kepler -o yaml > current-config.yaml # Kubernetes +``` + +### Community Resources + +- **πŸ› GitHub Issues:** [Create an issue](https://github.com/sustainable-computing-io/kepler/issues/new) with logs and configuration +- **πŸ’¬ GitHub Discussions:** [Ask questions](https://github.com/sustainable-computing-io/kepler/discussions) +- **πŸ—¨οΈ CNCF Slack:** Join [#kepler](https://cloud-native.slack.com/archives/C06HYDN4A01) for real-time help +- **πŸ“š Documentation:** [Full documentation](https://sustainable-computing.io/kepler/docs/) + +### Issue Report Template + +When reporting issues, please include: + +```text +**Environment:** +- OS: [Linux/macOS/Windows] +- Deployment: [Docker Compose/Kubernetes] +- Kepler version: [from kepler_build_info metric] + +**Problem:** +[Describe what's not working] + +**Expected:** +[What you expected to happen] + +**Logs:** +[Paste relevant log output] + +**Configuration:** +[Paste relevant configuration] +``` + +--- + +## πŸ“š Related Documentation + +- **[Getting Started Guide](getting-started.md)** - Basic setup instructions +- **[Installation Guide](installation.md)** - Production deployment +- **[Configuration Guide](configuration.md)** - Detailed configuration options +- **[Metrics Reference](metrics.md)** - Available metrics documentation + +--- + +Happy troubleshooting! πŸ”§