|
| 1 | +# EP-004: ACPI Support for Platform Power/Energy Metrics |
| 2 | + |
| 3 | +**Status**: Draft |
| 4 | +**Author**: Ray Huang |
| 5 | +**Created**: 2025-08-28 |
| 6 | +**Last Updated**: 2025-08-28 |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +The proposal aims to add ACPI support for collecting power metrics in Kepler, specifically for nodes that do not have Redfish capabilities. |
| 11 | +This enhancement will allow users to gather platform power metrics from a wider range of hardware, improving the flexibility of Kepler in diverse hardware environments. |
| 12 | + |
| 13 | +## Problem Statement |
| 14 | + |
| 15 | +Not all nodes have Redfish for getting platform power metrics, so this proposal aims to add another option to Kepler that allows users to collect power metrics with ACPI that do not support Redfish. |
| 16 | + |
| 17 | +### Current Limitations |
| 18 | + |
| 19 | +1. Nodes without Redfish support cannot get platform power metrics with Kepler. |
| 20 | +2. RAPL may not be available on all hardware, limiting the ability to monitor power metrics. |
| 21 | +3. Redfish support is unable to provide platform energy metrics. |
| 22 | + |
| 23 | +## Goals |
| 24 | + |
| 25 | +- **Primary**: Add ACPI power meter monitoring option to Kepler. |
| 26 | +- **Seamless Integration**: Integrate with existing Kepler architecture and service patterns |
| 27 | +- **Standard Metrics**: Provide platform power metrics via Prometheus following Kepler conventions |
| 28 | +- **Multi-Environment Support**: Support Kubernetes and standalone deployments |
| 29 | + |
| 30 | +## Non-Goals |
| 31 | + |
| 32 | +- Replace Redfish support. (Just complementary) |
| 33 | +- Support all ACPI features such as setting power limits. (focus on reading power_meter) |
| 34 | +- Integration with virtualized environments (e.g., VMs without ACPI pass-through). |
| 35 | +- Backward compatibility with legacy ACPI versions (<4.0) or non-Linux systems. |
| 36 | + |
| 37 | +## Requirements |
| 38 | + |
| 39 | +### Functional Requirements |
| 40 | + |
| 41 | +- Use [golang sysfs package](https://github.com/prometheus/procfs) which is used by RAPL to read ACPI power_meter so that no additional dependencies. |
| 42 | +- Generate `kepler_platform_watts{source="acpi"}` and |
| 43 | + `kepler_platform_joules_total{source="acpi"}` metrics |
| 44 | +- Follow Kepler's configuration patterns and coding conventions |
| 45 | + |
| 46 | +### Non-Functional Requirements |
| 47 | + |
| 48 | +- **Performance**: Try to minimize overhead during metric collection. |
| 49 | +- **Reliability**: Handle module loading failures and sysfs read errors without crashing Kepler. |
| 50 | +- **Security**: Same as RAPL part. Only read access for host's sysfs. |
| 51 | +- **Maintainability**: Modular code structure. |
| 52 | +- **Testability**: Mock sysfs for unit/integration tests. |
| 53 | + |
| 54 | +## Proposed Solution |
| 55 | + |
| 56 | +### High-Level Architecture |
| 57 | + |
| 58 | +Add ACPI option which is an alternative of Redfish to Kepler. |
| 59 | +ACPI can collect platform power metrics from reading sysfs. The platform collector is able to reuse the one that Redfish uses if merged. |
| 60 | + |
| 61 | +```text |
| 62 | +┌─────────────────┐ ┌───────────────────┐ |
| 63 | +│ CPU Power │ │ Platform Power │ |
| 64 | +│ (RAPL) │ │ (Redfish or ACPI) │ |
| 65 | +└─────────────────┘ └───────────────────┘ |
| 66 | + │ │ |
| 67 | + ▼ ▼ |
| 68 | +┌─────────────────┐ ┌────────────────────┐ |
| 69 | +│ Power Monitor │ │ Power Monitor │ |
| 70 | +│ (Attribution) │ │ (For ACPI, │ |
| 71 | +│ │ │ no attribution) │ |
| 72 | +└─────────────────┘ └────────────────────┘ |
| 73 | + │ │ |
| 74 | + ▼ ▼ |
| 75 | +┌─────────────────┐ ┌────────────────────┐ |
| 76 | +│ Power Collector │ │ Platform Collector │ |
| 77 | +│ │ │ │ |
| 78 | +└─────────────────┘ └────────────────────┘ |
| 79 | + └──────────┬─────────────┘ |
| 80 | + ▼ |
| 81 | + ┌──────────────────┐ |
| 82 | + │ Prometheus │ |
| 83 | + │ Exporter │ |
| 84 | + └──────────────────┘ |
| 85 | +``` |
| 86 | + |
| 87 | +### Aggregate power metrics |
| 88 | + |
| 89 | +According to [the ACPI spec for power meter](https://docs.kernel.org/hwmon/acpi_power_meter.html#special-features), there may be more than one power meter available per platform (`/sys/bus/acpi/drivers/power_meter/ACPI000D:XX/power[1-*]_average`). |
| 90 | +To acquire power metrics for whole platform, I will aggregate metrics from all available power meters. |
| 91 | + |
| 92 | +## Detailed Design |
| 93 | + |
| 94 | +### Package Structure |
| 95 | + |
| 96 | +```text |
| 97 | +cmd/ |
| 98 | +├── kepler/ |
| 99 | +│ └── main.go # Add ACPI power monitor when enabled |
| 100 | +config/ |
| 101 | +│ └── config.go # Add config for ACPI |
| 102 | +internal/ |
| 103 | +├── device/ |
| 104 | +│ └── platform/ |
| 105 | +│ └── acpi/ |
| 106 | +| ├── testdata/ |
| 107 | +| | └── sys/bus/acpi/drivers/power_meter/... |
| 108 | +│ ├── service.go # Power monitor for ACPI implementation |
| 109 | +│ ├── acpi_power_meter_test.go |
| 110 | +│ └── acpi_power_meter.go # Implements power_meter interface which read sysfs to get metrics |
| 111 | +└── exporter/prometheus/collector/ |
| 112 | + └── platform_collector.go # Platform power metrics collector |
| 113 | +``` |
| 114 | + |
| 115 | +### API/Interface Changes |
| 116 | + |
| 117 | +[Describe any changes to public APIs, service interfaces, or data structures. Show code snippets for new or modified interfaces.] |
| 118 | + |
| 119 | +```go |
| 120 | +// acpi_power_meter.go - Implements powerMeter interface |
| 121 | +type acpiPowerMeter struct { |
| 122 | + logger *slog.Logger |
| 123 | + devicePath string |
| 124 | +} |
| 125 | + |
| 126 | +// Service for ACPI power meter |
| 127 | +type Service struct { |
| 128 | + logger *slog.Logger |
| 129 | + powerMeter acpiPowerMeter |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +## Configuration |
| 134 | + |
| 135 | +### Main Configuration Changes |
| 136 | + |
| 137 | +`ACPI.Enabled` is False by default. |
| 138 | + |
| 139 | +```go |
| 140 | +type Platform struct { |
| 141 | + Redfish Redfish `yaml:"redfish"` |
| 142 | + ACPI ACPI `yaml:"acpi"` |
| 143 | +} |
| 144 | + |
| 145 | +type ACPI struct { |
| 146 | + Enabled *bool `yaml:"enabled"` |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +**CLI flags** |
| 151 | + |
| 152 | +``` |
| 153 | +--platform.acpi.enabled=true # Enable ACPI monitoring |
| 154 | +``` |
| 155 | + |
| 156 | +### Configuration File (if applicable) |
| 157 | + |
| 158 | +```yaml |
| 159 | +# add lines to /etc/kepler/config.yaml |
| 160 | +platform: |
| 161 | + acpi: |
| 162 | + enabled: false |
| 163 | +``` |
| 164 | +
|
| 165 | +### Security Considerations |
| 166 | +
|
| 167 | +This requires sysfs read privilege which is originally requested by RAPL part. |
| 168 | +
|
| 169 | +## Deployment Examples |
| 170 | +
|
| 171 | +### Kubernetes Environment |
| 172 | +
|
| 173 | +```yaml |
| 174 | +apiVersion: apps/v1 |
| 175 | +kind: DaemonSet |
| 176 | +metadata: |
| 177 | + name: kepler |
| 178 | +spec: |
| 179 | + template: |
| 180 | + spec: |
| 181 | + containers: |
| 182 | + - name: kepler |
| 183 | + args: |
| 184 | + - --kube.enable=true |
| 185 | + - --kube.node-name=$(NODE_NAME) |
| 186 | + - --platform.acpi.enabled=true |
| 187 | + env: |
| 188 | + - name: NODE_NAME |
| 189 | + valueFrom: |
| 190 | + fieldRef: |
| 191 | + fieldPath: spec.nodeName |
| 192 | +``` |
| 193 | +
|
| 194 | +### Standalone Deployment |
| 195 | +
|
| 196 | +```bash |
| 197 | +# Run Kepler with ACPI support |
| 198 | +./kepler --platform.acpi.enabled=true |
| 199 | +``` |
| 200 | + |
| 201 | +## Testing Strategy |
| 202 | + |
| 203 | +### Test Coverage |
| 204 | + |
| 205 | +- **Unit Tests**: |
| 206 | + - Verify acpiPowerMeter correctly reads and parses a single power1_average file. |
| 207 | + - Test aggregation of multiple power meters (power1_average, power2_average) into a single metric. |
| 208 | + - Simulate sysfs file read failures and verify error handling. |
| 209 | + - Test calculation of `kepler_platform_joules_total` by multiplying wattage with elapsed time. |
| 210 | +- **Integration Tests**: |
| 211 | + - Validate that enabling `ACPI.Enabled=true` initializes the ACPI power meter service. |
| 212 | + - Ensure metrics are correctly labeled with `source="acpi"` and exposed via the Prometheus exporter. |
| 213 | +- **E2E Tests**: |
| 214 | + - Deploy Kepler in a Kubernetes cluster and verify metrics are available in Prometheus. |
| 215 | + - Test a node without ACPI support and ensure Kepler falls back to other available collectors (e.g., RAPL). |
| 216 | + |
| 217 | +### Test Infrastructure |
| 218 | + |
| 219 | +- Testdata Structure: Mimic real sysfs structure under `internal/device/platform/acpi/testdata/` |
| 220 | + - Fake multiple power meters (`ACPI000D:xx`) or multiple measurements (`power*_average`) for testing |
| 221 | + - e.g., `testdata/sys/bus/acpi/drivers/power_meter/ACPI000D:00/power1_average`, with sample values for power readings (e.g., 450500000 microwatts for 450.5 watts). |
| 222 | + |
| 223 | +## Migration and Compatibility |
| 224 | + |
| 225 | +### Backward Compatibility |
| 226 | +No conflict with existing RAPL support. |
| 227 | + |
| 228 | +### Migration Path |
| 229 | + |
| 230 | +1. **Phase 1**: Update Kepler to the new version with ACPI support. |
| 231 | +2. **Phase 2**: Configure `platform.acpi.enabled=true` in `/etc/kepler/config.yaml` or via CLI. |
| 232 | +3. **Phase 3**: Deploy updated DaemonSet or restart standalone Kepler. |
| 233 | + |
| 234 | +### Rollback Strategy |
| 235 | + |
| 236 | +Set `ACPI.Enabled` to false and restart the Kepler service. |
| 237 | + |
| 238 | +## Metrics Output |
| 239 | + |
| 240 | +ACPI provides native power **average** metrics, so we can expose `kepler_platform_joules_total` by multiplying `power[1-*]_average` with `power[1-*]_average_interval`. |
| 241 | + |
| 242 | +```prometheus |
| 243 | +# Platform power metrics with new source |
| 244 | +kepler_platform_watts{source="acpi",node_name="worker-1"} 450.5 |
| 245 | +kepler_platform_joules_total{source="acpi",node_name="worker-1"} 123456.789 |
| 246 | +
|
| 247 | +# Existing CPU power metrics (unchanged) |
| 248 | +kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2 |
| 249 | +kepler_node_cpu_joules_total{zone="package",node_name="worker-1"} 89234.567 |
| 250 | +``` |
| 251 | + |
| 252 | +## Implementation Plan |
| 253 | + |
| 254 | +### Phase 1: Foundation |
| 255 | + |
| 256 | +- Implement ACPI-related configuration and validation |
| 257 | +- Implement ACPI power meter service structure |
| 258 | + |
| 259 | +### Phase 2: Core Functionality |
| 260 | + |
| 261 | +- Read and parse ACPI power meter data |
| 262 | +- Prometheus exporter with ACPI source |
| 263 | + |
| 264 | +### Phase 3: Testing and Documentation |
| 265 | + |
| 266 | +- Kubernetes deployment testing |
| 267 | +- ACPI-related testing |
| 268 | +- User documentation and deployment guides |
| 269 | + |
| 270 | +## Risks and Mitigations |
| 271 | + |
| 272 | +### Technical risks |
| 273 | + |
| 274 | +- **Several measurements under one power meter**: Aggregate `power*_average` to the power meter reading value. |
| 275 | +- **Several ACPI power meters detected**: Implement aggregation logic for to combine readings from multiple ACPI power meters (`ACPI000D:xx`). |
| 276 | + |
| 277 | +### Operational risks |
| 278 | + |
| 279 | +- **No kernel module or device available**: Show error message and fallback to operation without ACPI enabled. |
| 280 | + |
| 281 | +## Success Metrics |
| 282 | + |
| 283 | +- **Functional Metric**: All nodes with `ACPI000D:xx` available will be able to expose platform power metrics. |
| 284 | +- **Adoption Metric**: Documentation enables successful deployment by operations teams. |
| 285 | + |
| 286 | +## Open Questions |
| 287 | + |
| 288 | +1. **Multiple sources for same metrics**: Redfish and ACPI can both provide platform power metrics. Should I prioritize one over the other, or can Prometheus handle multiple sources automatically? |
| 289 | + |
| 290 | +2. **Platform power metrics**: This question is related to the first one. ACPI reports the power average instead of instant power like Redfish. Should we use the same metrics name in Prometheus? Or should I use something like `kepler_platform_average_watts` for ACPI? |
| 291 | + |
| 292 | +3. **About the E2E test**: I think the Github runner doesn't expose ACPI metrics. How do we do the testing in that case? |
0 commit comments