Skip to content

Add neuron-device-plugin load test to the test bed #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

shvbsle
Copy link
Contributor

@shvbsle shvbsle commented Mar 24, 2025

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@shvbsle shvbsle marked this pull request as ready for review March 26, 2025 04:53
@shvbsle shvbsle changed the title WIP: add neuron load test to the test bed Add neuron-device-plugin load test to the test bed Mar 26, 2025
{{$neuronResourcesPerPod := DefaultParam .CL2_NEURON_RESOURCES_PER_POD 64}}
{{$neuronPods := DefaultParam .CL2_NEURON_PODS .Nodes}}

name: neuron-workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not. I will take a look and get back to you on that one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot directly use that config since the label selectors and default-pod manifest would have to be changed anyway.

I think the current approach of maintaining a separate config for neuron is cleaner and gives us more more control over the load-tests

@shvbsle shvbsle marked this pull request as draft April 11, 2025 17:35
@shvbsle shvbsle marked this pull request as ready for review April 17, 2025 00:40
@shvbsle shvbsle marked this pull request as draft April 18, 2025 17:20
@shvbsle
Copy link
Contributor Author

shvbsle commented Apr 18, 2025

marking as draft because I've synced the PR with the latest changes from main branch which moves the pipelines to SMNG. Will mark as ready once I do another test-run of the pipeline with latest changes.

@shvbsle
Copy link
Contributor Author

shvbsle commented Apr 19, 2025

Confirmed that this works on Self managed node groups as well

@shvbsle shvbsle marked this pull request as ready for review April 19, 2025 02:12
daemonsets are ready. This ensures that the load-tests don't start
prematurely and inflate the pod startup latency numbers. Removed
neuron-scheduler since it is not being used
default: ""
- name: neuron-pod-url
description: "URL for the Neuron pod specification file for loadtest"
default: ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be value to empty defaults, if there is none, these shouldn't be defined. If someone runs this task by itself with defaults it will fail.

The idea of defaults is that, when not provided anything, it should be able to run and succeed without issues.

Also, try to avoid param values defined at pipeline if it can be set at task level, so that we can avoid changing all pipelines leveraging tasks now or in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies everywhere you have empty string for default that doesn't serve any purpose.

Thank you.

Params:
action: start
labelSelector: group = neuron-worker
threshold: 60s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream SLOs is <=5sec.
Wondering why are we doing 60sec ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants