Help reproducing training run #8

ctjlewis · 2025-01-26T19:41:56Z

8x H100 80GB on Lambda Labs (gpu_8x_h100_sxm5).

If I send the job from inside examples/script, I get an error that train_ppo_qwen_base_math_lv35_new.sh does not exist.
If I set --working-dir=. from examples/script, it can't resolve the openrlhf module from inside /tmp.

If I adjust the script to point to the Qwen 2.5 Math 7B snapshot from huggingface download:

python3 openrlhf/cli/train_ppo_ray_box.py \
  ...
  --pretrain ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Math-7B/snapshots/b101308fe89651ea5ce025f25317fea6fc07e96e \
  ...

And send the job from train/ after starting the cluster:

(venv) ubuntu@192-222-52-101:~/simpleRL-reason/train$ ray job submit --address="http://127.0.0.1:8265"         --runtime-env-json='{
        "pip": ["ray==2.12.0", "latex2sympy2", "timeout_decorator"]
    }'  -- /bin/bash examples/script/train_ppo_qwen_base_math_lv35_new.sh
Job submission server address: http://127.0.0.1:8265

-------------------------------------------------------
Job 'raysubmit_tUKfTdfzCNf7hq93' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_tUKfTdfzCNf7hq93
  Query the status of the job:
    ray job status raysubmit_tUKfTdfzCNf7hq93
  Request the job to be stopped:
    ray job stop raysubmit_tUKfTdfzCNf7hq93

Tailing logs until the job exits (disable with --no-wait):
[2025-01-26 19:30:29,678] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2025-01-26 19:30:35,900 INFO worker.py:1429 -- Using address 0.0.0.0:6379 set in the environment variable RAY_ADDRESS
2025-01-26 19:30:35,900 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 0.0.0.0:6379...
2025-01-26 19:30:35,906 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
(pid=294709) [2025-01-26 19:30:38,800] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=294913) [2025-01-26 19:30:45,749] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=294914) [2025-01-26 19:30:46,316] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=296122) [2025-01-26 19:30:52,813] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)

It doesn't error, and I can see the cluster in ray status:

Every 2.0s: ray status                           192-222-52-101: Sun Jan 26 19:34:37 2025

======== Autoscaler status: 2025-01-26 19:31:10.592241 ========
Node status
---------------------------------------------------------------
Active:
 1 node_40d13a9a1a337ac7e40fa07ece7e9653d1d8844b67d2f4ecec97b6e6
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 8.0/208.0 CPU (8.0 used of 8.0 reserved in placement groups)
 8.0/8.0 GPU (8.0 used of 8.0 reserved in placement groups)
 0B/1.54TiB memory
 0B/186.26GiB object_store_memory

Demands:
 {'CPU': 8.0, 'GPU': 8.0} * 1 (STRICT_SPREAD): 1+ pending placement groups

But nothing ever goes into GPU memory or seems to start, wandb never gets any information despite being logged in and key set etc.

Anyone have any idea what might be going on? train_ppo_qwen_base_math_lv35_new.sh should definitely call openrlhf and start pretraining but never seems to.

Sun Jan 26 19:39:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:61:00.0 Off |                    0 |
| N/A   31C    P0             70W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:62:00.0 Off |                    0 |
| N/A   32C    P0             73W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:63:00.0 Off |                    0 |
| N/A   27C    P0             68W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:64:00.0 Off |                    0 |
| N/A   30C    P0             73W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:6A:00.0 Off |                    0 |
| N/A   32C    P0             69W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:6B:00.0 Off |                    0 |
| N/A   28C    P0             70W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:6C:00.0 Off |                    0 |
| N/A   30C    P0             71W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:6D:00.0 Off |                    0 |
| N/A   27C    P0             70W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

The text was updated successfully, but these errors were encountered:

ctjlewis · 2025-01-26T20:11:50Z

@jxhe, looking at the training script: does this expect 16 GPUs?

ctjlewis · 2025-01-26T20:13:26Z

It clearly provides --num-gpus 8, so I don't think so...

@Zeng-WH, @HYZ17, could you help here?

ctjlewis · 2025-01-26T21:20:19Z

OK, it is in fact expecting two nodes?

HYZ17 · 2025-01-27T02:47:56Z

The minimum hardware requirement is 6 H/A100-80G GPUs. (We haven't tested it yet)

For more detailed, please refer to here. Thank you !

ctjlewis · 2025-01-27T02:49:52Z

Thank you very much, I will close.

ctjlewis · 2025-01-27T06:59:04Z

@HYZ17 I will reopen this briefly. For the single node (8x A100 80GB), we get this result:

~/ds-utah/simpleRL-reason/train$ ray job submit --address="http://127.0.0.1:8265"         --runtime-env-json='{
        "pip": ["ray==2.12.0", "latex2sympy2", "timeout_decorator"]
    }' -- /bin/bash ~/ds-utah/simpleRL-reason/train/examples/script/train_ppo_qwen_base_math_lv35_1_node.sh
Job submission server address: http://127.0.0.1:8265

-------------------------------------------------------
Job 'raysubmit_dVBCv4TiwMPym2Yt' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_dVBCv4TiwMPym2Yt
  Query the status of the job:
    ray job status raysubmit_dVBCv4TiwMPym2Yt
  Request the job to be stopped:
    ray job stop raysubmit_dVBCv4TiwMPym2Yt

Tailing logs until the job exits (disable with --no-wait):
[2025-01-27 06:56:35,785] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2025-01-27 06:56:44,696 INFO worker.py:1429 -- Using address 0.0.0.0:6379 set in the environment variable RAY_ADDRESS
2025-01-27 06:56:44,696 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 0.0.0.0:6379...
2025-01-27 06:56:44,704 INFO worker.py:1740 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
(pid=66198) [2025-01-27 06:56:48,969] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=66408) [2025-01-27 06:56:55,546] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=66409) [2025-01-27 06:56:55,607] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=66693) [2025-01-27 06:57:02,366] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=66694) [2025-01-27 06:57:02,300] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(ActorModelRayActorBOX pid=66198) [2025-01-27 06:57:05,553] [INFO] [comm.py:652:init_distributed] cdb=None
(ActorModelRayActorBOX pid=66198) [2025-01-27 06:57:05,553] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(ActorModelRayActorBOX pid=66198) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
(ActorModelRayActorBOX pid=66198) [2025-01-27 06:57:05,673] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2
(ActorModelRayActorBOX pid=66408) *** SIGSEGV received at time=1737961028 on cpu 190 ***
(ActorModelRayActorBOX pid=66408) PC: @     0x7c4bb7092ce8  (unknown)  ncclTopoCheckNet()
(ActorModelRayActorBOX pid=66408)     @     0x7c7c0de42520       3392  (unknown)
(ActorModelRayActorBOX pid=66408)     @         0xffffffff  (unknown)  (unknown)
(ActorModelRayActorBOX pid=66408) [2025-01-27 06:57:08,317 E 66408 67044] logging.cc:365: *** SIGSEGV received at time=1737961028 on cpu 190 ***
(ActorModelRayActorBOX pid=66408) [2025-01-27 06:57:08,317 E 66408 67044] logging.cc:365: PC: @     0x7c4bb7092ce8  (unknown)  ncclTopoCheckNet()
(ActorModelRayActorBOX pid=66408) [2025-01-27 06:57:08,319 E 66408 67044] logging.cc:365:     @     0x7c7c0de42520       3392  (unknown)
(ActorModelRayActorBOX pid=66408) [2025-01-27 06:57:08,321 E 66408 67044] logging.cc:365:     @         0xffffffff  (unknown)  (unknown)
(ActorModelRayActorBOX pid=66408) Fatal Python error: Segmentation fault
(ActorModelRayActorBOX pid=66408) 
(ActorModelRayActorBOX pid=66408) 
(ActorModelRayActorBOX pid=66408) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, regex._regex (total: 100)
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffd64cd47551eaad41428b1a6802000000 Worker ID: 2d351643de6ea3af66643b3952c6f84ad105d265d6241b25db742f1f Node ID: 1cc1783f3910ea44d820adb039cd7b37b4787bb614621e4dd2a5eb9c Worker IP address: 0.0.0.0 Worker port: 10393 Worker PID: 66408 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
  File "/home/ubuntu/ds-utah/simpleRL-reason/train/openrlhf/cli/train_ppo_ray_box.py", line 395, in <module>
    train(args)
  File "/home/ubuntu/ds-utah/simpleRL-reason/train/openrlhf/cli/train_ppo_ray_box.py", line 148, in train
    ray.get(refs)
  File "/tmp/ray/session_2025-01-27_06-56-05_347623_43287/runtime_resources/pip/40e785e8e6da2f735a7b98b5e6f5fc5a02c5eabe/virtualenv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/tmp/ray/session_2025-01-27_06-56-05_347623_43287/runtime_resources/pip/40e785e8e6da2f735a7b98b5e6f5fc5a02c5eabe/virtualenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/tmp/ray/session_2025-01-27_06-56-05_347623_43287/runtime_resources/pip/40e785e8e6da2f735a7b98b5e6f5fc5a02c5eabe/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2623, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/tmp/ray/session_2025-01-27_06-56-05_347623_43287/runtime_resources/pip/40e785e8e6da2f735a7b98b5e6f5fc5a02c5eabe/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: ActorModelRayActorBOX
        actor_id: d64cd47551eaad41428b1a6802000000
        pid: 66408
        namespace: 98d0a8a0-3ece-45f2-8865-ce164241e38d
        ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:05,760] [INFO] [comm.py:652:init_distributed] cdb=None [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(ReferenceModelRayActor pid=66409) [2025-01-27 06:57:05,551] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:05,804] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2 [repeated 3x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff884d2f387e586e190b1fffe802000000 Worker ID: 76f76200298edb8f2a59b07ebf09a96315cbefb3189492b88c225f0c Node ID: 1cc1783f3910ea44d820adb039cd7b37b4787bb614621e4dd2a5eb9c Worker IP address: 0.0.0.0 Worker port: 10391 Worker PID: 66198 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(ReferenceModelRayActor pid=66693) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) *** SIGSEGV received at time=1737961028 on cpu 77 *** [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) PC: @     0x7965bf092ce8  (unknown)  ncclTopoCheckNet() [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693)     @     0x799615242520       3392  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693)     @         0xffffffff  (unknown)  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,480 E 66693 67052] logging.cc:365: *** SIGSEGV received at time=1737961028 on cpu 77 *** [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,480 E 66693 67052] logging.cc:365: PC: @     0x7965bf092ce8  (unknown)  ncclTopoCheckNet() [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,482 E 66693 67052] logging.cc:365:     @     0x799615242520       3392  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,484 E 66693 67052] logging.cc:365:     @         0xffffffff  (unknown)  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) Fatal Python error: Segmentation fault [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693)  [repeated 6x across cluster]
(ReferenceModelRayActor pid=66693) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, regex._regex (total: 100) [repeated 3x across cluster]

---------------------------------------
Job 'raysubmit_dVBCv4TiwMPym2Yt' failed
---------------------------------------

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
(ReferenceModelRayActor pid=66693) PC: @     0x7965bf092ce8  (unknown)  ncclTopoCheckNet() [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693)     @     0x799615242520       3392  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693)     @         0xffffffff  (unknown)  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,480 E 66693 67052] logging.cc:365: *** SIGSEGV received at time=1737961028 on cpu 77 *** [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,480 E 66693 67052] logging.cc:365: PC: @     0x7965bf092ce8  (unknown)  ncclTopoCheckNet() [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,482 E 66693 67052] logging.cc:365:     @     0x799615242520       3392  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) [2025-01-27 06:57:08,484 E 66693 67052] logging.cc:365:     @         0xffffffff  (unknown)  (unknown) [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693) Fatal Python error: Segmentation fault [repeated 3x across cluster]
(ReferenceModelRayActor pid=66693)  [repeated 6x across cluster]
(ReferenceModelRayActor pid=66693) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, regex._regex (total: 100) [repeated 3x across cluster]

Zeng-WH · 2025-01-27T09:45:03Z

Hi, we tested the 1-node script again and there were no issues. Maybe you can check Ray's status using 'ray status' to verify if it started correctly

eyuansu62 · 2025-01-30T14:22:06Z

I also encountered this error and ray starts sucessfully. I think it is the fault of hardward.

ctjlewis closed this as completed Jan 27, 2025

ctjlewis reopened this Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help reproducing training run #8

Help reproducing training run #8

ctjlewis commented Jan 26, 2025 •

edited

Loading

ctjlewis commented Jan 26, 2025

ctjlewis commented Jan 26, 2025 •

edited

Loading

ctjlewis commented Jan 26, 2025

HYZ17 commented Jan 27, 2025

ctjlewis commented Jan 27, 2025

ctjlewis commented Jan 27, 2025 •

edited

Loading

Zeng-WH commented Jan 27, 2025

eyuansu62 commented Jan 30, 2025

Help reproducing training run #8

Help reproducing training run #8

Comments

ctjlewis commented Jan 26, 2025 • edited Loading

ctjlewis commented Jan 26, 2025

ctjlewis commented Jan 26, 2025 • edited Loading

ctjlewis commented Jan 26, 2025

HYZ17 commented Jan 27, 2025

ctjlewis commented Jan 27, 2025

ctjlewis commented Jan 27, 2025 • edited Loading

Zeng-WH commented Jan 27, 2025

eyuansu62 commented Jan 30, 2025

ctjlewis commented Jan 26, 2025 •

edited

Loading

ctjlewis commented Jan 26, 2025 •

edited

Loading

ctjlewis commented Jan 27, 2025 •

edited

Loading