Skip to content

Conversation

JosefNagelschmidt
Copy link
Contributor

Why are these changes needed?

There is currently a race condition when setting spec.jobId in an update call of the CR. Additionally, ray job submit is started asynchronously in a goroutine, which can lead to exec: not started.

See issue #4070.

Related issue number

Closes #4070.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@JosefNagelschmidt JosefNagelschmidt changed the title avoid race in jodId avoid race in jobId Sep 11, 2025
Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing it, I have tested locally.

@Future-Outlier Future-Outlier self-assigned this Sep 13, 2025
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Hi, @JosefNagelschmidt
Would you like to connect on Ray Slack? I’d love to learn more about how I can help.
My name is Han-Ju Chen (ray team) (GitHub - Future-Outlier)

@Future-Outlier
Copy link
Member

cc @rueian for review, I am confident that this can be merged.

@Future-Outlier Future-Outlier changed the title avoid race in jobId [kubectl-plugin] avoid race in jobId Sep 19, 2025
if err != nil {
return fmt.Errorf("Failed to get latest version of Ray job: %w", err)
}
options.RayJob.Spec.JobId = rayJobID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a reason why we set the job id after creating the CR. Could you resolve the conflict of the object has been modified by retrying the update?

Copy link
Collaborator

@win5923 win5923 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my thought:

  • Before this PR, when a user specified a submissionID, the RayJob CR had already been created without it. To keep the CR aligned with the actual Ray submission, we had to do a post-create Get/Update to set spec.jobId.
  • In this PR, we generate (or specified by user) the submissionID upfront and embed it into the RayJob before applying/creating it (e.g., via rayJobApplyConfig.Spec.JobId). This makes the submissionID ensures consistency, and removes the need for a follow-up Get/Update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a reason why we set the job id after creating the CR. Could you resolve the conflict of the object has been modified by retrying the update?

It is an option, but I wonder why this would be necessary (what is the exact reason for setting the job id after creating the CR? It seems to work seamlessly for me the other way). The suggested fix looks cleaner for me at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] job submit fails due to post-create update of spec.jobId
4 participants