-
Notifications
You must be signed in to change notification settings - Fork 26
Model Server Discussion
Sunyanan Choochotkaew edited this page Sep 13, 2022
·
7 revisions
- Ratio power model
- use one representative metric for power consumption division on each component (CPU, DRAM, GPU).
- Parameterized power model
- apply linear regression weights to multiple relevant metrics learned from empirical results for each component.
- ML power model
- apply learning/regression model to multiple relevant metrics for estimating each component power in the profiled conditions (e.g., architecture).
- Generalized dynamic power model
- apply learning/regression model to available metrics for estimating total power from all components regardless of running environments.
- ratio power model to present pod power share on each component
- parameterized power model to estimate node-level power for each component when no power measurement is available
- generalized dynamic power model to present total dynamic power according to only pod resource usage
pros:
- more complicated model, discover hidden information
cons:
- require enough learning data for each profile
purposes:
- providing weights for parameterized power model (or ML power model) and model for generalized dynamic power
- performing online training
flowchart LR;
exporter -- unix.sock --> estimator -- ? --> model-server
- parameterized power model:
flowchart LR;
exporter -- API routes --> model-server
- ML power model and generalized dynamic power model:
-
with API routes
flowchart LR; exporter -- unix.sock --> estimator -- API routes --> model-server
-
with shared storage system (NFS, ceph, fuse, rook.io?)
flowchart LR; exporter -- unix.sock --> estimator --> model-storage ; model-server --> model-storage;
approach pros cons API Route - no third-party dependency - need to handle synchronization ourselve for up-to-date model after trainning - need to optimize data (model) transmission ourselves - data passing through the network stack even within the same node Shared storage - leave file synchronisation works to mature shared storage system - require third-party setup (and might have potential overhead on cluster) - not scale with some storage system
-
flowchart LR;
query-data --> pipelines --> models
Each pipeline is composed of different
- input: query metric (e.g., node_energy_stat), target columns (e.g., cpu_cycles, pkg_energy_in_joules)
- training function: Keras layers, scikit regressor, ...
- output: pod dynamic power, node power in core/dram
- all information is provided: all pipelines are activated
- frequency and architecture is not reported: ML power model pipelines cannot be trained
- only some usage measurements are enabled: only the pipelines that relies on those measurements are activated
- define pipeline
- input (usage metrics (e.g., cpu_cycles) + system metrics (e.g., frequency, architecture))
- output (parameterized model, ML power model, dynamic power model)
- implement train function
using corresponding dataframe from prom_client[usage_query][usage_metrics], prom_client[system_query][system_metrics]
def train(self, prom_client):
- set model path to [output]/[input]/[model name] (current: [input]/[model_name])
- [input]: Full, WorkloadOnly, CgroupOnly, CounterOnly, KubeletOnly
- [output]: power, dynamic_power
- [model_name]: produced model name; one pipeline can produce multiple model. For example, scikit-pipeline can produce
linear regression
,polynomial regression
,gradient boosting regression
andKNN regression
at the same time.
- implement estimator to apply each output
- implement API routes for ML archieve
- test/validation of new pipeline
- support large model with s3/COS link
- distributed training cluster
- h5, savedModel, tflite
- extend API route for ml-based model, for node, for pod
- shared storage (future)
- ...