-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Kubeflow TrainJob (v2) #3884
Comments
TrainJob -> JobSet -> (Job A + JobB) I think you'd want to explore integration with JobSet before you started with jobs. What exactly is the end state of integration? Do we want DependsOn or are we fine with support via JobSet? If we want to support DependsOn I think Kubeflow community should focus on implementing that in JobSet. Whatever we add we will have to support. So "Short Term" features would end up being long term features as we can't really drop support once customers use it. |
I did not indicate that we aim to implement Even if Kubeflow uses the JobSet After Kueue supports the separate admission Job, the TrainJob Kueue integration will be implemented as JobSet depending integration. |
What would you like to be added:
I would like to support the Kubeflow TrainJob, which is Kubeflow TrainingOperator v2.
The new TrainJob AP relies on the JobSet API, but we should not enqueue the entire TrainJob or JobSet since the TrainJob uses the JobSet
DependOn
feature, which has some Kueue collaboration problems, as we mentioned thereHence, in the short term, the TrainJob queueing should rely on the Kueue batch/v1 Job integration to mitigate and avoid the problems once the TrainJob is submitted. This concept is similar to Deployment integration, which depends on the Pod integration.
Additionally, we might be able to extend the JobSet integration so that the JobSet with dependsOn will be enqueued by batch/v1 Job integration to avoid the long-term unused computing resources lock.
Why is this needed:
Currently, we are supporting the Kubeflow v1 APIs, but the v1 APIs will be stopped support within the next year.
We will probably completely stop the v1 support in the next 2 or 3 Training Operator releases.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: