Add neuron-device-plugin load test to the test bed #499

shvbsle · 2025-03-24T12:26:16Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

hakuna-matatah · 2025-03-28T18:45:11Z

tests/assets/neuron/config.yaml

+{{$neuronResourcesPerPod := DefaultParam .CL2_NEURON_RESOURCES_PER_POD 64}}
+{{$neuronPods := DefaultParam .CL2_NEURON_PODS .Nodes}}
+
+name: neuron-workers


have you evaluated if we can leverage https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/scheduler-throughput/config.yaml instead of additional config ?

I have not. I will take a look and get back to you on that one

I cannot directly use that config since the label selectors and default-pod manifest would have to be changed anyway.

I think the current approach of maintaining a separate config for neuron is cleaner and gives us more more control over the load-tests

tests/assets/neuron/pod.yaml

tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml

tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml

tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml

load test result outcome

shvbsle · 2025-04-18T17:22:15Z

marking as draft because I've synced the PR with the latest changes from main branch which moves the pipelines to SMNG. Will mark as ready once I do another test-run of the pipeline with latest changes.

shvbsle · 2025-04-19T02:12:41Z

Confirmed that this works on Self managed node groups as well

daemonsets are ready. This ensures that the load-tests don't start prematurely and inflate the pod startup latency numbers. Removed neuron-scheduler since it is not being used

hakuna-matatah · 2025-04-22T14:09:23Z

tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml

+      default: ""
+    - name: neuron-pod-url
+      description: "URL for the Neuron pod specification file for loadtest"
+      default: ""


There should be value to empty defaults, if there is none, these shouldn't be defined. If someone runs this task by itself with defaults it will fail.

The idea of defaults is that, when not provided anything, it should be able to run and succeed without issues.

Also, try to avoid param values defined at pipeline if it can be set at task level, so that we can avoid changing all pipelines leveraging tasks now or in future.

This comment applies everywhere you have empty string for default that doesn't serve any purpose.

Thank you.

hakuna-matatah · 2025-04-22T20:43:15Z

tests/assets/neuron/config.yaml

+    Params:
+      action: start
+      labelSelector: group = neuron-worker
+      threshold: 60s


Upstream SLOs is <=5sec.
Wondering why are we doing 60sec ?

shvbsle added 4 commits March 24, 2025 12:24

add neuron plugin to the test bed

72ce8b9

updated installation tasks and load-test tasks

92d1d25

integrated neuron installation plugins with the main pipeline

f85a86a

exposed more params in the pipeline

65bcba2

shvbsle marked this pull request as ready for review March 26, 2025 04:53

shvbsle changed the title ~~WIP: add neuron load test to the test bed~~ Add neuron-device-plugin load test to the test bed Mar 26, 2025