Skip to content

Use large instance for OOM test #52400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Apr 17, 2025

Why are these changes needed?

In #52250, community contributor does a feature-wise no-op change to split large targets into smaller ones, but CI keeps failing due to OOM.

[2025-04-17T05:33:59Z] Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
--
  | [2025-04-17T05:33:59Z] gcc: fatal error: Killed signal terminated program cc1plus

The reason why splitting targets uses more memory is bazel parallelize compilation on targets level, so smaller targets generally indicates more parallelism, thus more memory consumption.
I also see people increasing resource allocation in another PR, my hunch is we're already at the verge of OOM. :(

Related issue number

Closes #52399

@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Apr 17, 2025
@dentiny dentiny requested review from aslonnie and jjyao April 17, 2025 08:34
Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is the truly the case as described, then we would need to change the size of almost all the test jobs, because almost all of them compiles the wheel. they compile with bazel cache though, but still..

seems to me the right thing to do is:

  • either to reduce parallelism when compiling, by configuring a smaller --job argument
  • or try to only compile the parts once on a bigger machine, and reuse on smaller machines.

just increase the machine size for everything has big implication on ci cost profile, and does not really make sense to me.

@edoakes
Copy link
Collaborator

edoakes commented Apr 17, 2025

Agreed ^ for now let's reduce parallelism

@dentiny
Copy link
Contributor Author

dentiny commented Apr 18, 2025

either to reduce parallelism when compiling, by configuring a smaller --job argument

Sure, job value is by default cpu core count; do you have a recommended value?
I think it‘s also risky, which could slow down build operation a lot potentially;
upgrading to larger instance is the brainless go-to solution option IMO.

Update:
I'm willing to update job number if you have suggestion :)

@aslonnie
Copy link
Collaborator

what I do not understand is why only these two run oom..

I looked at the other PR that is triggering the oom, there are other tests that are compiling the wheel on even smaller machines, for example:
https://buildkite.com/ray-project/premerge/builds/38454#0196422b-2244-4a66-a507-ad5d6d80fc88

but are not running oom.

@aslonnie
Copy link
Collaborator

#51673 is different btw. that is about compiling the debug wheel.

@kenchung285
Copy link
Contributor

kenchung285 commented Apr 20, 2025

Hi @aslonnie , I'm the developer working on #52250. I have test the large instance applied to the ci in a049468 and they did works for the OOM issue. (CI success: https://buildkite.com/ray-project/premerge/builds/38552)
Please also note that building the debug wheel also needs compile the raylet thus we need to use large instance, too. I did this in b8773bd.
cc @dentiny

@aslonnie
Copy link
Collaborator

almost all our CI jobs need to compile raylet today.. and if we allow https://github.com/ray-project/ray/pull/52250/files to merge, it might break a lot more things if we do not really understand what exactly is going on with the memory footprint of the compiling process.

it seems it only affects the compiling in manylinux environment, so it might be a config issue somewhere, rather than machine not large enough. or it might be related to the compiler used in manylinux.

could you either do more investigation, or give me some time to do investigation on this issue myself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Ray build failure due to noop refactoring
6 participants