-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Use large instance for OOM test #52400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is the truly the case as described, then we would need to change the size of almost all the test jobs, because almost all of them compiles the wheel. they compile with bazel cache though, but still..
seems to me the right thing to do is:
- either to reduce parallelism when compiling, by configuring a smaller
--job
argument - or try to only compile the parts once on a bigger machine, and reuse on smaller machines.
just increase the machine size for everything has big implication on ci cost profile, and does not really make sense to me.
Agreed ^ for now let's reduce parallelism |
Sure, job value is by default cpu core count; do you have a recommended value? Update: |
what I do not understand is why only these two run oom.. I looked at the other PR that is triggering the oom, there are other tests that are compiling the wheel on even smaller machines, for example: but are not running oom.
|
#51673 is different btw. that is about compiling the debug wheel. |
Hi @aslonnie , I'm the developer working on #52250. I have test the large instance applied to the ci in a049468 and they did works for the OOM issue. (CI success: https://buildkite.com/ray-project/premerge/builds/38552) |
almost all our CI jobs need to compile raylet today.. and if we allow https://github.com/ray-project/ray/pull/52250/files to merge, it might break a lot more things if we do not really understand what exactly is going on with the memory footprint of the compiling process. it seems it only affects the compiling in manylinux environment, so it might be a config issue somewhere, rather than machine not large enough. or it might be related to the compiler used in manylinux. could you either do more investigation, or give me some time to do investigation on this issue myself? |
Why are these changes needed?
In #52250, community contributor does a feature-wise no-op change to split large targets into smaller ones, but CI keeps failing due to OOM.
The reason why splitting targets uses more memory is bazel parallelize compilation on targets level, so smaller targets generally indicates more parallelism, thus more memory consumption.
I also see people increasing resource allocation in another PR, my hunch is we're already at the verge of OOM. :(
Related issue number
Closes #52399