Skip to content

Conversation

KaiyiLiu1234
Copy link
Collaborator

Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available.

Introduces enhancement proposal for adding Machine Learning models to estimate
kepler power metrics in a Virtual Machine environment when hardware power measurement
interfaces like RAPL are not available.

Signed-off-by: Kaiyi Liu <[email protected]>
@github-actions github-actions bot added the docs Documentation changes label Aug 25, 2025
Copy link
Contributor

�[1m 🔆🔆🔆 Validating 🔆🔆🔆 �[0m
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Profiling reports are ready to be viewed

⚠️ Variability in pprof CPU and Memory profiles
When comparing pprof profiles of Kepler versions, expect variability in CPU and memory. Focus only on significant, consistent differences.

💻 CPU Comparison with base Kepler
File: kepler
Type: cpu
Time: 2025-08-25 11:29:51 UTC
Duration: 120s, Total samples = 450ms ( 0.37%)
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -70ms, 15.56% of 450ms total
      flat  flat%   sum%        cum   cum%
         0     0%     0%      -40ms  8.89%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).scheduleNextCollection.func1
     -30ms  6.67%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectProcessMetrics
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
     -10ms  2.22%  8.89%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0%  8.89%      -20ms  4.44%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
         0     0%  8.89%      -20ms  4.44%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
         0     0%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
         0     0%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
     -10ms  2.22% 11.11%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*TerminatedResourceTracker[go.shape.*uint8]).Add
      10ms  2.22%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.newProcess (inline)
     -10ms  2.22% 11.11%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
     -10ms  2.22% 13.33%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).CPUTime
     -10ms  2.22% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).Cgroups
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).updateProcessCache
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.computeTypeInfoFromProc.func1
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromProc
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.populateProcessFields
💾 Memory Comparison with base Kepler (Inuse)
File: kepler
Type: inuse_space
Time: 2025-08-25 11:31:51 UTC
Duration: 120.01s, Total samples = 4780.43kB 
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -1027.99kB, 21.50% of 4780.43kB total
      flat  flat%   sum%        cum   cum%
         0     0%     0% -1027.99kB 21.50%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
 -516.01kB 10.79% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
         0     0% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
 -511.98kB 10.71% 21.50%  -511.98kB 10.71%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectProcessMetrics
💾 Memory Comparison with base Kepler (Alloc)
File: kepler
Type: alloc_space
Time: 2025-08-25 11:31:51 UTC
Duration: 120.01s, Total samples = 37784.13kB 
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -7187.87kB, 19.02% of 37784.13kB total
Dropped 2 nodes (cum <= 188.92kB)
      flat  flat%   sum%        cum   cum%
         0     0%     0% -3588.92kB  9.50%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0%     0% -3588.92kB  9.50%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
         0     0%     0% -3069.68kB  8.12%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
-2571.04kB  6.80%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).CPUUsageRatio
         0     0%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh.func3
         0     0%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshNode
-2560.90kB  6.78% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).CPUTime
         0     0% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).updateProcessCache
         0     0% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.populateProcessFields
         0     0% 13.58% -2051.55kB  5.43%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).scheduleNextCollection.func1
         0     0% 13.58% -2045.54kB  5.41%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
 -516.01kB  1.37% 14.95% -1540.17kB  4.08%  github.com/sustainable-computing-io/kepler/internal/monitor.(*Snapshot).Clone
         0     0% 14.95% -1529.51kB  4.05%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
-1028.02kB  2.72% 17.67% -1028.02kB  2.72%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
         0     0% 17.67%  1026.38kB  2.72%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*cpuInfoCollector).Collect
 1026.38kB  2.72% 14.95%  1026.38kB  2.72%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*realProcFS).CPUInfo
 -512.02kB  1.36% 16.31% -1024.16kB  2.71%  github.com/sustainable-computing-io/kepler/internal/monitor.(*Process).Clone (inline)
-1024.16kB  2.71% 19.02% -1024.16kB  2.71%  github.com/sustainable-computing-io/kepler/internal/monitor.newProcess (inline)
 1024.06kB  2.71% 16.31%  1024.06kB  2.71%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectNodeMetrics
         0     0% 16.31%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.computeTypeInfoFromProc.func1
    -514kB  1.36% 17.67%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromCgroupPaths
         0     0% 17.67%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromProc
 -512.14kB  1.36% 19.02%  -512.14kB  1.36%  maps.Copy[go.shape.map[github.com/sustainable-computing-io/kepler/internal/device.EnergyZone]github.com/sustainable-computing-io/kepler/internal/monitor.Usage,go.shape.map[github.com/sustainable-computing-io/kepler/internal/device.EnergyZone]github.com/sustainable-computing-io/kepler/internal/monitor.Usage,go.shape.interface { Energy ; Index int; MaxEnergy github.com/sustainable-computing-io/kepler/internal/device.Energy; Name string; Path string },go.shape.struct { EnergyTotal github.com/sustainable-computing-io/kepler/internal/device.Energy; Power github.com/sustainable-computing-io/kepler/internal/device.Power }] (inline)

⬇️ Download the Profiling artifacts from the Actions Summary page

📦 Artifact name: profile-artifacts-2291

🔧 Or use GitHub CLI to download artifacts:

gh run download 17207440019 -n profile-artifacts-2291


## Problem Statement

Virtual machines lack direct access to hardware power measurement interfaces (RAPL, IPMI, etc.) that are essential for energy monitoring in cloud and virtualized environments. Current Kepler deployments in VMs cannot provide accurate power consumption estimates because they cannot access the underlying hardware power consumption data. This creates a significant gap in energy monitoring capabilities for the growing virtualized infrastructure landscape.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we mention that PMU is usually disabled for VMs by providers, so VMs do not show any performance counter also.

- **Primary Goal**: Develop zone-specific machine learning models (package, core, DRAM, uncore) for CPU power estimation in VMs
- **Secondary Goal**: Create a production-ready deployment system for VM power models in Go environments
- **Tertiary Goal**: Establish best practices for VM power modeling including CPU pinning and isolation requirements
- **Performance Goal**: Achieve <10% mean absolute percentage error compared to baremetal measurements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer RMS error over MAPE

### Functional Requirements

- **FR1**: Train separate ML models for each power zone (package, core, DRAM, uncore)
- **FR2**: Use only VM-accessible OS and memory counters as input features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code seems to show it, but pls add a section showing list of features used for model training.

kepler_vm_last_training_timestamp{zone="package"} 1692984532
```

## Implementation Plan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically enhancement proposals do not contain implementation plan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some detail about

  • Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach
  • Training Data Requirements, mentioning training data source, if any preprocessing on data, training data size requirements
  • how to detect/prevent model overfitting?
  • hyperparameters

is there any dependence on number of vCPU? if not why, if yes what does this mean for model training/selection
consider a typical case of 128 CPU baremetal machine running multiple VMs, some with 4 vCPUs, some with 8 vCPUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants