Use large instance for OOM test #52400

dentiny · 2025-04-17T07:28:18Z

Why are these changes needed?

In #52250, community contributor does a feature-wise no-op change to split large targets into smaller ones, but CI keeps failing due to OOM.

[2025-04-17T05:33:59Z] Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
--
  | [2025-04-17T05:33:59Z] gcc: fatal error: Killed signal terminated program cc1plus

The reason why splitting targets uses more memory is bazel parallelize compilation on targets level, so smaller targets generally indicates more parallelism, thus more memory consumption.
I also see people increasing resource allocation in another PR, my hunch is we're already at the verge of OOM. :(

Related issue number

Closes #52399

aslonnie

if it is the truly the case as described, then we would need to change the size of almost all the test jobs, because almost all of them compiles the wheel. they compile with bazel cache though, but still..

seems to me the right thing to do is:

either to reduce parallelism when compiling, by configuring a smaller --job argument
or try to only compile the parts once on a bigger machine, and reuse on smaller machines.

just increase the machine size for everything has big implication on ci cost profile, and does not really make sense to me.

edoakes · 2025-04-17T14:24:15Z

Agreed ^ for now let's reduce parallelism

dentiny · 2025-04-18T06:11:25Z

either to reduce parallelism when compiling, by configuring a smaller --job argument

Sure, job value is by default cpu core count; do you have a recommended value?
I think it‘s also risky, which could slow down build operation a lot potentially;
upgrading to larger instance is the brainless go-to solution option IMO.

Update:
I'm willing to update job number if you have suggestion :)

aslonnie · 2025-04-18T09:47:15Z

what I do not understand is why only these two run oom..

I looked at the other PR that is triggering the oom, there are other tests that are compiling the wheel on even smaller machines, for example:
https://buildkite.com/ray-project/premerge/builds/38454#0196422b-2244-4a66-a507-ad5d6d80fc88

but are not running oom.

could you explain why?
could you show me proof that changing these two to large actually helps resolving the issue in https://github.com/ray-project/ray/pull/52250/files ?

aslonnie · 2025-04-18T09:48:24Z

#51673 is different btw. that is about compiling the debug wheel.

kenchung285 · 2025-04-20T13:13:00Z

could you show me proof that changing these two to large actually helps resolving the issue in https://github.com/ray-project/ray/pull/52250/files ?

Hi @aslonnie , I'm the developer working on #52250. I have test the large instance applied to the ci in a049468 and they did works for the OOM issue. (CI success: https://buildkite.com/ray-project/premerge/builds/38552)
Please also note that building the debug wheel also needs compile the raylet thus we need to use large instance, too. I did this in b8773bd.
cc @dentiny

aslonnie · 2025-04-21T05:10:24Z

almost all our CI jobs need to compile raylet today.. and if we allow https://github.com/ray-project/ray/pull/52250/files to merge, it might break a lot more things if we do not really understand what exactly is going on with the memory footprint of the compiling process.

it seems it only affects the compiling in manylinux environment, so it might be a config issue somewhere, rather than machine not large enough. or it might be related to the compiler used in manylinux.

could you either do more investigation, or give me some time to do investigation on this issue myself?

use large instance for OOM test

a0a1105

dentiny added the go add ONLY when ready to merge, run all tests label Apr 17, 2025

dentiny requested review from aslonnie and jjyao April 17, 2025 08:34

dentiny assigned jjyao and aslonnie Apr 17, 2025

aslonnie reviewed Apr 17, 2025

View reviewed changes

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use large instance for OOM test #52400

Use large instance for OOM test #52400

Uh oh!

dentiny commented Apr 17, 2025 •

edited

Loading

Uh oh!

aslonnie left a comment

Uh oh!

edoakes commented Apr 17, 2025

Uh oh!

dentiny commented Apr 18, 2025 •

edited

Loading

Uh oh!

aslonnie commented Apr 18, 2025

Uh oh!

aslonnie commented Apr 18, 2025

Uh oh!

kenchung285 commented Apr 20, 2025 •

edited

Loading

Uh oh!

aslonnie commented Apr 21, 2025

Uh oh!

Uh oh!

Use large instance for OOM test #52400

Are you sure you want to change the base?

Use large instance for OOM test #52400

Uh oh!

Conversation

dentiny commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes commented Apr 17, 2025

Uh oh!

dentiny commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aslonnie commented Apr 18, 2025

Uh oh!

aslonnie commented Apr 18, 2025

Uh oh!

kenchung285 commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aslonnie commented Apr 21, 2025

Uh oh!

Uh oh!

dentiny commented Apr 17, 2025 •

edited

Loading

dentiny commented Apr 18, 2025 •

edited

Loading

kenchung285 commented Apr 20, 2025 •

edited

Loading