Skip to content

Commit d1e84de

Browse files
committed
docs(proposal): add ACPI support for Kepler
Signed-off-by: Ray Huang <[email protected]>
1 parent 6d9c50c commit d1e84de

File tree

2 files changed

+294
-1
lines changed

2 files changed

+294
-1
lines changed
Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# EP-004: ACPI Support for Platform Power/Energy Metrics
2+
3+
**Status**: Draft
4+
**Author**: Ray Huang
5+
**Created**: 2025-08-28
6+
**Last Updated**: 2025-08-28
7+
8+
## Summary
9+
10+
The proposal aims to add ACPI support for collecting power metrics in Kepler, specifically for nodes that do not have Redfish capabilities.
11+
This enhancement will allow users to gather platform power metrics from a wider range of hardware, improving the flexibility of Kepler in diverse hardware environments.
12+
13+
## Problem Statement
14+
15+
Not all nodes have Redfish for getting platform power metrics, so this proposal aims to add another option to Kepler that allows users to collect power metrics with ACPI that do not support Redfish.
16+
17+
### Current Limitations
18+
19+
1. Nodes without Redfish support cannot get platform power metrics with Kepler.
20+
2. RAPL may not be available on all hardware, limiting the ability to monitor power metrics.
21+
3. Redfish support is unable to provide platform energy metrics.
22+
23+
## Goals
24+
25+
- **Primary**: Add ACPI power meter monitoring option to Kepler.
26+
- **Seamless Integration**: Integrate with existing Kepler architecture and service patterns
27+
- **Standard Metrics**: Provide platform power metrics via Prometheus following Kepler conventions
28+
- **Multi-Environment Support**: Support Kubernetes and standalone deployments
29+
30+
## Non-Goals
31+
32+
- Replace Redfish support. (Just complementary)
33+
- Support all ACPI features such as setting power limits. (focus on reading power_meter)
34+
- Integration with virtualized environments (e.g., VMs without ACPI pass-through).
35+
- Backward compatibility with legacy ACPI versions (<4.0) or non-Linux systems.
36+
37+
## Requirements
38+
39+
### Functional Requirements
40+
41+
- Use [golang sysfs package](https://github.com/prometheus/procfs) which is used by RAPL to read ACPI power_meter so that no additional dependencies.
42+
- Generate `kepler_platform_watts{source="acpi"}` and
43+
`kepler_platform_joules_total{source="acpi"}` metrics
44+
- Follow Kepler's configuration patterns and coding conventions
45+
46+
### Non-Functional Requirements
47+
48+
- **Performance**: Try to minimize overhead during metric collection.
49+
- **Reliability**: Handle module loading failures and sysfs read errors without crashing Kepler.
50+
- **Security**: Same as RAPL part. Only read access for host's sysfs.
51+
- **Maintainability**: Modular code structure.
52+
- **Testability**: Mock sysfs for unit/integration tests.
53+
54+
## Proposed Solution
55+
56+
### High-Level Architecture
57+
58+
Add ACPI option which is an alternative of Redfish to Kepler.
59+
ACPI can collect platform power metrics from reading sysfs. The platform collector is able to reuse the one that Redfish uses if merged.
60+
61+
```text
62+
┌─────────────────┐ ┌───────────────────┐
63+
│ CPU Power │ │ Platform Power │
64+
│ (RAPL) │ │ (Redfish or ACPI) │
65+
└─────────────────┘ └───────────────────┘
66+
│ │
67+
▼ ▼
68+
┌─────────────────┐ ┌────────────────────┐
69+
│ Power Monitor │ │ Power Monitor │
70+
│ (Attribution) │ │ (For ACPI, │
71+
│ │ │ no attribution) │
72+
└─────────────────┘ └────────────────────┘
73+
│ │
74+
▼ ▼
75+
┌─────────────────┐ ┌────────────────────┐
76+
│ Power Collector │ │ Platform Collector │
77+
│ │ │ │
78+
└─────────────────┘ └────────────────────┘
79+
└──────────┬─────────────┘
80+
81+
┌──────────────────┐
82+
│ Prometheus │
83+
│ Exporter │
84+
└──────────────────┘
85+
```
86+
87+
### Aggregate power metrics
88+
89+
According to [the ACPI spec for power meter](https://docs.kernel.org/hwmon/acpi_power_meter.html#special-features), there may be more than one power meter available per platform (`/sys/bus/acpi/drivers/power_meter/ACPI000D:XX/power[1-*]_average`).
90+
To acquire power metrics for whole platform, I will aggregate metrics from all available power meters.
91+
92+
## Detailed Design
93+
94+
### Package Structure
95+
96+
```text
97+
cmd/
98+
├── kepler/
99+
│ └── main.go # Add ACPI power monitor when enabled
100+
config/
101+
│ └── config.go # Add config for ACPI
102+
internal/
103+
├── device/
104+
│ └── platform/
105+
│ └── acpi/
106+
| ├── testdata/
107+
| | └── sys/bus/acpi/drivers/power_meter/...
108+
│ ├── service.go # Power monitor for ACPI implementation
109+
│ ├── acpi_power_meter_test.go
110+
│ └── acpi_power_meter.go # Implements power_meter interface which read sysfs to get metrics
111+
└── exporter/prometheus/collector/
112+
└── platform_collector.go # Platform power metrics collector
113+
```
114+
115+
### API/Interface Changes
116+
117+
[Describe any changes to public APIs, service interfaces, or data structures. Show code snippets for new or modified interfaces.]
118+
119+
```go
120+
// acpi_power_meter.go - Implements powerMeter interface
121+
type acpiPowerMeter struct {
122+
logger *slog.Logger
123+
devicePath string
124+
}
125+
126+
// Service for ACPI power meter
127+
type Service struct {
128+
logger *slog.Logger
129+
powerMeter acpiPowerMeter
130+
}
131+
```
132+
133+
## Configuration
134+
135+
### Main Configuration Changes
136+
137+
`ACPI.Enabled` is False by default.
138+
139+
```go
140+
type Platform struct {
141+
Redfish Redfish `yaml:"redfish"`
142+
ACPI ACPI `yaml:"acpi"`
143+
}
144+
145+
type ACPI struct {
146+
Enabled *bool `yaml:"enabled"`
147+
}
148+
```
149+
150+
**CLI flags**
151+
152+
```
153+
--platform.acpi.enabled=true # Enable ACPI monitoring
154+
```
155+
156+
### Configuration File (if applicable)
157+
158+
```yaml
159+
# add lines to /etc/kepler/config.yaml
160+
platform:
161+
acpi:
162+
enabled: false
163+
```
164+
165+
### Security Considerations
166+
167+
This requires sysfs read privilege which is originally requested by RAPL part.
168+
169+
## Deployment Examples
170+
171+
### Kubernetes Environment
172+
173+
```yaml
174+
apiVersion: apps/v1
175+
kind: DaemonSet
176+
metadata:
177+
name: kepler
178+
spec:
179+
template:
180+
spec:
181+
containers:
182+
- name: kepler
183+
args:
184+
- --kube.enable=true
185+
- --kube.node-name=$(NODE_NAME)
186+
- --platform.acpi.enabled=true
187+
env:
188+
- name: NODE_NAME
189+
valueFrom:
190+
fieldRef:
191+
fieldPath: spec.nodeName
192+
```
193+
194+
### Standalone Deployment
195+
196+
```bash
197+
# Run Kepler with ACPI support
198+
./kepler --platform.acpi.enabled=true
199+
```
200+
201+
## Testing Strategy
202+
203+
### Test Coverage
204+
205+
- **Unit Tests**:
206+
- Verify acpiPowerMeter correctly reads and parses a single power1_average file.
207+
- Test aggregation of multiple power meters (power1_average, power2_average) into a single metric.
208+
- Simulate sysfs file read failures and verify error handling.
209+
- Test calculation of `kepler_platform_joules_total` by multiplying wattage with elapsed time.
210+
- **Integration Tests**:
211+
- Validate that enabling `ACPI.Enabled=true` initializes the ACPI power meter service.
212+
- Ensure metrics are correctly labeled with `source="acpi"` and exposed via the Prometheus exporter.
213+
- **E2E Tests**:
214+
- Deploy Kepler in a Kubernetes cluster and verify metrics are available in Prometheus.
215+
- Test a node without ACPI support and ensure Kepler falls back to other available collectors (e.g., RAPL).
216+
217+
### Test Infrastructure
218+
219+
- Testdata Structure: Mimic real sysfs structure under `internal/device/platform/acpi/testdata/`
220+
- Fake multiple power meters (`ACPI000D:xx`) or multiple measurements (`power*_average`) for testing
221+
- e.g., `testdata/sys/bus/acpi/drivers/power_meter/ACPI000D:00/power1_average`, with sample values for power readings (e.g., 450500000 microwatts for 450.5 watts).
222+
223+
## Migration and Compatibility
224+
225+
### Backward Compatibility
226+
No conflict with existing RAPL support.
227+
228+
### Migration Path
229+
230+
1. **Phase 1**: Update Kepler to the new version with ACPI support.
231+
2. **Phase 2**: Configure `platform.acpi.enabled=true` in `/etc/kepler/config.yaml` or via CLI.
232+
3. **Phase 3**: Deploy updated DaemonSet or restart standalone Kepler.
233+
234+
### Rollback Strategy
235+
236+
Set `ACPI.Enabled` to false and restart the Kepler service.
237+
238+
## Metrics Output
239+
240+
ACPI provides native power **average** metrics, so we can expose `kepler_platform_joules_total` by multiplying `power[1-*]_average` with `power[1-*]_average_interval`.
241+
242+
```prometheus
243+
# Platform power metrics with new source
244+
kepler_platform_watts{source="acpi",node_name="worker-1"} 450.5
245+
kepler_platform_joules_total{source="acpi",node_name="worker-1"} 123456.789
246+
247+
# Existing CPU power metrics (unchanged)
248+
kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
249+
kepler_node_cpu_joules_total{zone="package",node_name="worker-1"} 89234.567
250+
```
251+
252+
## Implementation Plan
253+
254+
### Phase 1: Foundation
255+
256+
- Implement ACPI-related configuration and validation
257+
- Implement ACPI power meter service structure
258+
259+
### Phase 2: Core Functionality
260+
261+
- Read and parse ACPI power meter data
262+
- Prometheus exporter with ACPI source
263+
264+
### Phase 3: Testing and Documentation
265+
266+
- Kubernetes deployment testing
267+
- ACPI-related testing
268+
- User documentation and deployment guides
269+
270+
## Risks and Mitigations
271+
272+
### Technical risks
273+
274+
- **Several measurements under one power meter**: Aggregate `power*_average` to the power meter reading value.
275+
- **Several ACPI power meters detected**: Implement aggregation logic for to combine readings from multiple ACPI power meters (`ACPI000D:xx`).
276+
277+
### Operational risks
278+
279+
- **No kernel module or device available**: Show error message and fallback to operation without ACPI enabled.
280+
281+
## Success Metrics
282+
283+
- **Functional Metric**: All nodes with `ACPI000D:xx` available will be able to expose platform power metrics.
284+
- **Adoption Metric**: Documentation enables successful deployment by operations teams.
285+
286+
## Open Questions
287+
288+
1. **Multiple sources for same metrics**: Redfish and ACPI can both provide platform power metrics. Should I prioritize one over the other, or can Prometheus handle multiple sources automatically?
289+
290+
2. **Platform power metrics**: This question is related to the first one. ACPI reports the power average instead of instant power like Redfish. Should we use the same metrics name in Prometheus? Or should I use something like `kepler_platform_average_watts` for ACPI?
291+
292+
3. **About the E2E test**: I think the Github runner doesn't expose ACPI metrics. How do we do the testing in that case?

docs/developer/proposal/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ This directory contains Enhancement Proposals (EPs) for major features and chang
66

77
| ID | Title | Status | Author | Created |
88
|----|-------|--------|--------|---------|
9-
| [EP-000](EP_TEMPLATE.md) | Enhancement Proposal Template | Accepted |Sunil Thaha | 2025-01-18 |
9+
| [EP-000](EP_TEMPLATE.md) | Enhancement Proposal Template | Accepted | Sunil Thaha | 2025-01-18 |
10+
| [EP-004](EP_004-ACPI-support.md) | ACPI Support for Platform Power/Energy Metrics | Draft | Ray Huang | 2025-08-28 |
1011

1112
## Proposal Status
1213

0 commit comments

Comments
 (0)