Skip to content

Model Server Discussion

Sunyanan Choochotkaew edited this page Sep 13, 2022 · 7 revisions

Power models on Kepler

  • Ratio power model
    • use one representative metric for power consumption division on each component (CPU, DRAM, GPU).
  • Parameterized power model
    • apply linear regression weights to multiple relevant metrics learned from empirical results for each component.
  • ML power model
    • apply learning/regression model to multiple relevant metrics for estimating each component power in the profiled conditions (e.g., architecture).
  • Generalized dynamic power model
    • apply learning/regression model to available metrics for estimating total power from all components regardless of running environments.

Current Kepler uses

  • ratio power model to present pod power share on each component
  • parameterized power model to estimate node-level power for each component when no power measurement is available
  • generalized dynamic power model to present total dynamic power according to only pod resource usage

⚠️ Are we going to change parameterized power model to ML power model? (pros/cons)

pros:

  • more complicated model, discover hidden information

cons:

  • require enough learning data for each profile

Model Server

purposes:

  1. providing weights for parameterized power model (or ML power model) and model for generalized dynamic power
  2. performing online training

proving weights and models

flowchart LR;
   exporter -- unix.sock --> estimator -- ? --> model-server
Loading

⚠️ How to share models between kepler (and estimator) and model server?

  • parameterized power model:
flowchart LR;
   exporter -- API routes --> model-server
Loading
  • ML power model and generalized dynamic power model:
    • with API routes

         flowchart LR;
         exporter -- unix.sock --> estimator -- API routes --> model-server
      
      Loading
    • with shared storage system (NFS, ceph, fuse, rook.io?)

         flowchart LR;
         exporter -- unix.sock --> estimator --> model-storage ;
         model-server --> model-storage;
      
      Loading
      approach pros cons
      API Route - no third-party dependency - need to handle synchronization ourselve for up-to-date model after trainning
      - need to optimize data (model) transmission ourselves
      - data passing through the network stack even within the same node
      Shared storage - leave file synchronisation works to mature shared storage system - require third-party setup (and might have potential overhead on cluster)
      - not scale with some storage system

performing online training

Training pipeline

⚠️ What is it? What are the differences between each one?

flowchart LR;
   query-data --> pipelines --> models 
Loading

Each pipeline is composed of different

  • input: query metric (e.g., node_energy_stat), target columns (e.g., cpu_cycles, pkg_energy_in_joules)
  • training function: Keras layers, scikit regressor, ...
  • output: pod dynamic power, node power in core/dram

⚠️ What are the use cases?

  • all information is provided: all pipelines are activated
  • frequency and architecture is not reported: ML power model pipelines cannot be trained
  • only some usage measurements are enabled: only the pipelines that relies on those measurements are activated

⚠️ How to develop a custom pipeline? To develop a new custom pipeline,

  1. define pipeline
    • input (usage metrics (e.g., cpu_cycles) + system metrics (e.g., frequency, architecture))
    • output (parameterized model, ML power model, dynamic power model)
  2. implement train function
        def train(self, prom_client):
    using corresponding dataframe from prom_client[usage_query][usage_metrics], prom_client[system_query][system_metrics]

⚠️ TO-DO:

  • set model path to [output]/[input]/[model name] (current: [input]/[model_name])
    • [input]: Full, WorkloadOnly, CgroupOnly, CounterOnly, KubeletOnly
    • [output]: power, dynamic_power
    • [model_name]: produced model name; one pipeline can produce multiple model. For example, scikit-pipeline can produce linear regression, polynomial regression, gradient boosting regression and KNN regression at the same time.
  • implement estimator to apply each output
  • implement API routes for ML archieve
  • test/validation of new pipeline
  • support large model with s3/COS link
  • distributed training cluster
  • h5, savedModel, tflite
  • extend API route for ml-based model, for node, for pod
  • shared storage (future)
  • ...