-
Notifications
You must be signed in to change notification settings - Fork 42
Add neuron-device-plugin load test to the test bed #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
{{$neuronResourcesPerPod := DefaultParam .CL2_NEURON_RESOURCES_PER_POD 64}} | ||
{{$neuronPods := DefaultParam .CL2_NEURON_PODS .Nodes}} | ||
|
||
name: neuron-workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you evaluated if we can leverage https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/scheduler-throughput/config.yaml instead of additional config ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not. I will take a look and get back to you on that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot directly use that config since the label selectors and default-pod manifest would have to be changed anyway.
I think the current approach of maintaining a separate config for neuron is cleaner and gives us more more control over the load-tests
tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml
Show resolved
Hide resolved
tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml
Show resolved
Hide resolved
tests/tekton-resources/tasks/generators/clusterloader/load-neuron-device-plugin.yaml
Show resolved
Hide resolved
tests/tekton-resources/pipelines/eks/awscli-cl2-load-with-addons-slos.yaml
Show resolved
Hide resolved
marking as draft because I've synced the PR with the latest changes from main branch which moves the pipelines to SMNG. Will mark as ready once I do another test-run of the pipeline with latest changes. |
Confirmed that this works on Self managed node groups as well |
daemonsets are ready. This ensures that the load-tests don't start prematurely and inflate the pod startup latency numbers. Removed neuron-scheduler since it is not being used
default: "" | ||
- name: neuron-pod-url | ||
description: "URL for the Neuron pod specification file for loadtest" | ||
default: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be value to empty defaults, if there is none, these shouldn't be defined. If someone runs this task by itself with defaults it will fail.
The idea of defaults is that, when not provided anything, it should be able to run and succeed without issues.
Also, try to avoid param values defined at pipeline if it can be set at task level, so that we can avoid changing all pipelines leveraging tasks now or in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment applies everywhere you have empty string for default that doesn't serve any purpose.
Thank you.
Params: | ||
action: start | ||
labelSelector: group = neuron-worker | ||
threshold: 60s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream SLOs is <=5sec.
Wondering why are we doing 60sec
?
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.