-
Notifications
You must be signed in to change notification settings - Fork 215
docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler #2291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs(proposal): add EP-003 for VM CPU Power Modeling for Kepler #2291
Conversation
Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available. Signed-off-by: Kaiyi Liu <[email protected]>
�[1m 🔆🔆🔆 Validating 🔆🔆🔆 �[0m
💻 CPU Comparison with base Kepler
💾 Memory Comparison with base Kepler (Inuse)
💾 Memory Comparison with base Kepler (Alloc)
⬇️ Download the Profiling artifacts from the Actions Summary page 📦 Artifact name: 🔧 Or use GitHub CLI to download artifacts: gh run download 17207440019 -n profile-artifacts-2291 |
|
||
## Problem Statement | ||
|
||
Virtual machines lack direct access to hardware power measurement interfaces (RAPL, IPMI, etc.) that are essential for energy monitoring in cloud and virtualized environments. Current Kepler deployments in VMs cannot provide accurate power consumption estimates because they cannot access the underlying hardware power consumption data. This creates a significant gap in energy monitoring capabilities for the growing virtualized infrastructure landscape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we mention that PMU is usually disabled for VMs by providers, so VMs do not show any performance counter also.
- **Primary Goal**: Develop zone-specific machine learning models (package, core, DRAM, uncore) for CPU power estimation in VMs | ||
- **Secondary Goal**: Create a production-ready deployment system for VM power models in Go environments | ||
- **Tertiary Goal**: Establish best practices for VM power modeling including CPU pinning and isolation requirements | ||
- **Performance Goal**: Achieve <10% mean absolute percentage error compared to baremetal measurements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer RMS error over MAPE
### Functional Requirements | ||
|
||
- **FR1**: Train separate ML models for each power zone (package, core, DRAM, uncore) | ||
- **FR2**: Use only VM-accessible OS and memory counters as input features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code seems to show it, but pls add a section showing list of features used for model training.
kepler_vm_last_training_timestamp{zone="package"} 1692984532 | ||
``` | ||
|
||
## Implementation Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically enhancement proposals do not contain implementation plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some detail about
- Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach
- Training Data Requirements, mentioning training data source, if any preprocessing on data, training data size requirements
- how to detect/prevent model overfitting?
- hyperparameters
is there any dependence on number of vCPU? if not why, if yes what does this mean for model training/selection
consider a typical case of 128 CPU baremetal machine running multiple VMs, some with 4 vCPUs, some with 8 vCPUs
Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available.