Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kubeflow TrainJob (v2) #3884

Open
2 of 3 tasks
tenzen-y opened this issue Dec 18, 2024 · 4 comments
Open
2 of 3 tasks

Support Kubeflow TrainJob (v2) #3884

tenzen-y opened this issue Dec 18, 2024 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Dec 18, 2024

What would you like to be added:
I would like to support the Kubeflow TrainJob, which is Kubeflow TrainingOperator v2.

The new TrainJob AP relies on the JobSet API, but we should not enqueue the entire TrainJob or JobSet since the TrainJob uses the JobSet DependOn feature, which has some Kueue collaboration problems, as we mentioned there

Hence, in the short term, the TrainJob queueing should rely on the Kueue batch/v1 Job integration to mitigate and avoid the problems once the TrainJob is submitted. This concept is similar to Deployment integration, which depends on the Pod integration.

Additionally, we might be able to extend the JobSet integration so that the JobSet with dependsOn will be enqueued by batch/v1 Job integration to avoid the long-term unused computing resources lock.

Why is this needed:
Currently, we are supporting the Kubeflow v1 APIs, but the v1 APIs will be stopped support within the next year.
We will probably completely stop the v1 support in the next 2 or 3 Training Operator releases.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@tenzen-y tenzen-y added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 18, 2024
@tenzen-y
Copy link
Member Author

cc: @andreyvelich @mimowo @kannon92

@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

@mwielgus @mwysokin

@kannon92
Copy link
Contributor

Hence, in the short term, the TrainJob queueing should rely on the Kueue batch/v1 Job integration to mitigate and avoid the problems once the TrainJob is submitted. This concept is similar to Deployment integration, which depends on the Pod integration.

TrainJob -> JobSet -> (Job A + JobB)

I think you'd want to explore integration with JobSet before you started with jobs.

What exactly is the end state of integration? Do we want DependsOn or are we fine with support via JobSet?

If we want to support DependsOn I think Kubeflow community should focus on implementing that in JobSet. Whatever we add we will have to support. So "Short Term" features would end up being long term features as we can't really drop support once customers use it.

@tenzen-y
Copy link
Member Author

tenzen-y commented Dec 21, 2024

If we want to support DependsOn I think Kubeflow community should focus on implementing that in JobSet. Whatever we add we will have to support. So "Short Term" features would end up being long term features as we can't really drop support once customers use it.

I did not indicate that we aim to implement DependsOn in the Kubeflow repository, we will implement the DependsOn in the JobSet side, first.
The motivation for using batch/v1 Job Kueue integration is https://github.com/andreyvelich/jobset/blob/ea622a6d6ba8cfd876a7d662da2e141c977ecddc/keps/672-serial-job-execution/README.md#risks-and-mitigations.

Even if Kubeflow uses the JobSet DependsOn, we should implement the TrainJob integration as the Kueue batch/v1 Job depending integration in the 1st step since there is no way to enqueue the JobSet step by step.
In the long term, we need to consider how we can separately enqueue each ReplicatedJob as we mentioned in the JobSet DependsOn KEP.

After Kueue supports the separate admission Job, the TrainJob Kueue integration will be implemented as JobSet depending integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants