Model Server Discussion

Power models on Kepler

Ratio power model
- use one representative metric for power consumption division on each component (CPU, DRAM, GPU).
Parameterized power model
- apply linear regression weights to multiple relevant metrics learned from empirical results for each component.
ML power model
- apply learning/regression model to multiple relevant metrics for estimating each component power in the profiled conditions (e.g., architecture).
Generalized dynamic power model
- apply learning/regression model to available metrics for estimating total power from all components regardless of running environments.

ratio power model to present pod power share on each component
parameterized power model to estimate node-level power for each component when no power measurement is available
generalized dynamic power model to present total dynamic power according to only pod resource usage

⚠️ Are we going to change parameterized power model to ML power model? (pros/cons)

pros:

cons:

purposes:

providing weights for parameterized power model (or ML power model) and model for generalized dynamic power
performing online training

flowchart LR;
   exporter -- unix.sock --> estimator -- ? --> model-server

⚠️ How to share models between kepler (and estimator) and model server?

flowchart LR;
   exporter -- API routes --> model-server

ML power model and generalized dynamic power model:

with API routes

   flowchart LR;
   exporter -- unix.sock --> estimator -- API routes --> model-server

with shared storage system (NFS, ceph, fuse, rook.io?)

   flowchart LR;
   exporter -- unix.sock --> estimator --> model-storage ;
   model-server --> model-storage;

approach	pros	cons
API Route	- no third-party dependency	- need to handle synchronization ourselve for up-to-date model after trainning
		- need to optimize data (model) transmission ourselves
		- data passing through the network stack even within the same node
Shared storage	- leave file synchronisation works to mature shared storage system	- require third-party setup (and might have potential overhead on cluster)
		- not scale with some storage system

⚠️ What is it? What are the differences between each one?

flowchart LR;
   query-data --> pipelines --> models

Each pipeline is composed of different

input: query metric (e.g., node_energy_stat), target columns (e.g., cpu_cycles, pkg_energy_in_joules)
training function: Keras layers, scikit regressor, ...
output: pod dynamic power, node power in core/dram

⚠️ What are the use cases?

all information is provided: all pipelines are activated
frequency and architecture is not reported: ML power model pipelines cannot be trained
only some usage measurements are enabled: only the pipelines that relies on those measurements are activated

⚠️ How to develop a custom pipeline? To develop a new custom pipeline,

define pipeline
- input (usage metrics (e.g., cpu_cycles) + system metrics (e.g., frequency, architecture))
- output (parameterized model, ML power model, dynamic power model)
implement train function
```
    def train(self, prom_client):
```
using corresponding dataframe from prom_client[usage_query][usage_metrics], prom_client[system_query][system_metrics]

⚠️ TO-DO:

set model path to [output]/[input]/[model name] (current: [input]/[model_name])
- [input]: Full, WorkloadOnly, CgroupOnly, CounterOnly, KubeletOnly
- [output]: power, dynamic_power
- [model_name]: produced model name; one pipeline can produce multiple model. For example, scikit-pipeline can produce linear regression, polynomial regression, gradient boosting regression and KNN regression at the same time.
implement estimator to apply each output
implement API routes for ML archieve
test/validation of new pipeline
support large model with s3/COS link
distributed training cluster
h5, savedModel, tflite
extend API route for ml-based model, for node, for pod
shared storage (future)
...