Skip to content

[distributed][checkpoint]test_utils.py::TestDistWrapper::test_barrier random RuntimeError #1978

@zxd1997066

Description

@zxd1997066

🐛 Describe the bug

please get the wheel from https://github.com/intel/torch-xpu-ops/actions/runs/16826215961
or use gh download

gh run download 16826215961 --repo intel/torch-xpu-ops --name Torch-XPU-Wheel-1826 --dir path --pattern "*.zip"

git clone -b distributed_2.9 https://github.com/daisyden/pytorch.git
cd pytorch
pip install pytest expecttest zstandard
pip install -r requirements.txt

pytest -v test/distributed/checkpoint/test_utils.py::TestDistWrapper::test_barrier
Traceback (most recent call last):
  File "/home/jenkins/.conda/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 716, in wrapper
    self._join_processes(fn)
  File "/home/jenkins/.conda/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 980, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/home/jenkins/.conda/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1025, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 300.01091861724854 seconds

The pass rate is around 50%

Versions

PyTorch: https://github.com/daisyden/pytorch/tree/distributed_2.9

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions