[xpu] Support high stream for ProcessGroupXCCL #24

Chao1Han · 2025-09-15T03:08:34Z

Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in intel/torch-xpu-ops#1715, add register here.

This is not an API we want to support. Pull Request resolved: pytorch#162539 Approved by: https://github.com/ezyang, https://github.com/tianyu-l

…torch#163223) It should work with the current CUDA/ROCm device_capability enumeration anyway. But it will help to avoid unexpected triggering in the future Pull Request resolved: pytorch#163223 Approved by: https://github.com/jeffdaily

Pull Request resolved: pytorch#160928 Approved by: https://github.com/eellison

…its for flops and bandwidth (pytorch#162942) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: pytorch#162942 Approved by: https://github.com/ngimel

…or operations (pytorch#162651) Also updates the error message to point to the guide. Pull Request resolved: pytorch#162651 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#162117, pytorch#162307

…eady padded (pytorch#163130) resolves pytorch/torchtitan#1136 torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already ``` # pad DTensor._local_tensor fully_shard(model) sd = fsdp_model.state_dict() # reset_sharded_param should be a no-op in lazy_init loss = fsdp_model(inp).sum() ``` this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` Pull Request resolved: pytorch#163130 Approved by: https://github.com/tianyu-l

Some cleanup related to this RFC: pytorch#68742 Pull Request resolved: pytorch#163210 Approved by: https://github.com/ezyang

…h#162440) Per title Pull Request resolved: pytorch#162440 Approved by: https://github.com/Skylion007

A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes pytorch#157094 Fixes pytorch#157093 Fixes pytorch#157092 Fixes pytorch#157091 Fixes pytorch#157064 Fixes pytorch#157063 Fixes pytorch#157062 Fixes pytorch#157061 Fixes pytorch#157042 Fixes pytorch#157041 Fixes pytorch#157039 Fixes pytorch#157004 Pull Request resolved: pytorch#162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

Pull Request resolved: pytorch#163050 Approved by: https://github.com/jeffdaily

…61430) Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that. Pull Request resolved: pytorch#161430 Approved by: https://github.com/malfet

Fixes pytorch#158772 _replace works without graph break Pull Request resolved: pytorch#160139 Approved by: https://github.com/mlazos

### DDE ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*u0, 0) (unhinted: Eq(3*u0, 0)). (Size-like symbols: u0) Caused by: (_decomp/decompositions.py:1185 in _softmax) ``` ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)). (Size-like symbols: u0) Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0) # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax) ``` ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)). (Size-like symbols: u0) Caused by: (_refs/__init__.py:2218 in _reduction) ``` ### Cannot convert symbols to int ``` File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "orch/_inductor/sizevars.py", line 591, in size_hint return int(out) ^^^^^^^^ File "sympy/core/expr.py", line 342, in __int__ raise TypeError("Cannot convert symbols to int") ``` Pull Request resolved: pytorch#162216 Approved by: https://github.com/laithsakka, https://github.com/eellison

) I am trying to give some test files better owner labels than `module: unknown`. I am not sure them, but they seem pretty reasonable Pull Request resolved: pytorch#163203 Approved by: https://github.com/jcaip

…ytorch#163214) In the .dist-info/METADATA file, the version was not being written with the new sha. On python <3.11 (I think), the glob `**` will only match directories, so change this to `*`, which I checked that it will match both files and directories on py3.9 and py3.13 There's probably also a bunch of mismatches in RECORD but thats a problem for later Pull Request resolved: pytorch#163214 Approved by: https://github.com/huydhn

These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where. Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702) Pull Request resolved: pytorch#163138 Approved by: https://github.com/henryoier, https://github.com/ezyang

…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <[email protected]> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere

…ware limits for flops and bandwidth (pytorch#162942)" This reverts commit 627482a. Reverted pytorch#162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](pytorch#162942 (comment)))

…pgen (pytorch#163092) Pull Request resolved: pytorch#163092 Approved by: https://github.com/drisspg, https://github.com/Skylion007

…zation pattern (pytorch#161848) Summary: What: Enables CUDA support for concat linear int8_mm woq optimization pattern by: - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Differential Revision: D80884518 Pull Request resolved: pytorch#161848 Approved by: https://github.com/jerryzh168

…torch#163238) See the Note for explanation. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#163238 Approved by: https://github.com/albanD

The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: pytorch#163194 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152

The input tensor shape does not match the weight tensor shape, which was detected by the validation logic implemented in my other PR(pytorch#160408). The input tensor should have a shape of (2, 2, 3), since dimension 1 of the input (representing input channels) must match dimension 0 of the weight tensor (representing input channels). ref https://docs.pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html Pull Request resolved: pytorch#163148 Approved by: https://github.com/eellison

…60920) ## Summary - include local file path in `inductor_output_code` structured trace metadata - adjust structured trace tests for new `file_path` field ## Testing - `python test/dynamo/test_structured_trace.py StructuredTraceTest.test_compile_id_serialization_deserialization` - `lintrunner -a torch/_inductor/codecache.py torch/_inductor/graph.py test/dynamo/test_structured_trace.py` *(fails: MYPY failure)* ------ https://chatgpt.com/codex/tasks/task_e_68a2b02b54ec8323ae820120605a9f1c Pull Request resolved: pytorch#160920 Approved by: https://github.com/oulgen

) Some tests that were already failing changed status to skipped. Some model entries were missing. Pull Request resolved: pytorch#163256 Approved by: https://github.com/malfet Co-authored-by: Jeff Daily <[email protected]>

## Summary - add a test verifying that editing the local cache wrapper is picked up after Dynamo reset ## Testing - `lintrunner -a` *(fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure)* - `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v` ------ https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89 Pull Request resolved: pytorch#160943 Approved by: https://github.com/xmfan, https://github.com/albanD

Fixes pytorch#162692 When input is uneven sharded, redistribute input as Replicated. Pull Request resolved: pytorch#163241 Approved by: https://github.com/dcci

Pull Request resolved: pytorch#163121 Approved by: https://github.com/laithsakka

…#163123) Pull Request resolved: pytorch#163123 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163121

Looking into resolving this: pytorch#159599 Test Plan: Wait for executorch CI Pull Request resolved: pytorch#159664 Approved by: https://github.com/malfet

Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime. Current behavior: - Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3. - Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions. - Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed? - Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed. - We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed. - DynamicInts as nn.Module attributes are handled. - We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?) ```python x = DynamicInt(4) f(x) f(1) f(DynamicInt(3)) # same as f(3) ``` Follow-up work: - Specifying shape constraints, either at the int-level, e.g. ```python DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"] ``` or at the compilation level, e.g. something like ```python s0 = DynamicInt(64, name="s0") s1 = DynamicInt(128, name="s1") with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]): f(s0, s1) ``` This should subsume the need for specifying derived SymInts? - SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor. - Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`. Differential Revision: D81698719 Pull Request resolved: pytorch#162194 Approved by: https://github.com/bobrenjc93

…62540) This information can be obtained during the dispatching. Pull Request resolved: pytorch#162540 Approved by: https://github.com/ezyang, https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: pytorch#162539

pytorch#162987) Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass. Pull Request resolved: pytorch#162987 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519

Summary: Point people lowering to lite interpreter to the existence of ExecuTorch. Added the typing deprecation, a warnings deprecation Test Plan: Try using it, see deprecation warning Reviewed By: lucylq Differential Revision: D82759566 Pull Request resolved: pytorch#163289 Approved by: https://github.com/larryliu0820

Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean. Pull Request resolved: pytorch#163293 Approved by: https://github.com/zpcore

What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely: - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such) - Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase - Copied test type dependencies from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps I think in the long run, one needs to update https://github.com/pytorch/test-infra/blob/4432e2cacd8a5bb7a46158e71d08c937c502e35a/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time Pull Request resolved: pytorch#162862 Approved by: https://github.com/wdvr, https://github.com/huydhn

We are not doing ring attention but only using allgather to do CP for Flex. Pull Request resolved: pytorch#162541 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: pytorch#162539, pytorch#162540

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#162590 Approved by: https://github.com/jeffdaily

…h#162740) Summary: This diff does a few things: - It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it - It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized. Here's an example of what that JSON looks like: ``` { "artifacts": { "precompile_aot_autograd": [ "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d" ], "precompile_dynamo": [ { "backend_ids": [ "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d" ], "fn_name": "TorchBenchmarkRunner.forward_and_backward_pass", "num_codes": "10", "python_version": "3.12.11+meta", "torch_version": "2.10.0a0+fb" } ] }, "num_entries": 1 } ``` Test Plan: Existing tests pass. NanoGPT tlparse showing the new debug: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR. Rollback Plan: Differential Revision: D82232574 Pull Request resolved: pytorch#162740 Approved by: https://github.com/zhxchen17

…63163) Pull Request resolved: pytorch#163163 Approved by: https://github.com/yf225

Update the torch-xpu-ops commit to intel/torch-xpu-ops@24fab67, includes: - Clean up getDeviceIndexOfCurrentQueue - Fix hardswish gradients corner case - Fix xccl contiguous check - Move checks from nonzero kernel to operator - support high priority stream for xccl Pull Request resolved: pytorch#163244 Approved by: https://github.com/EikanWang

Chao1Han changed the title ~~support high stream~~ [xpu] Support high stream for ProcessGroupXCCL Sep 16, 2025

Chao1Han force-pushed the high_stream branch from 203f701 to b4f70bf Compare September 17, 2025 01:28

fegin and others added 28 commits September 18, 2025 06:20

[CP][BE] Remove _AttentionContextParallel (pytorch#162539)

708dc6e

This is not an API we want to support. Pull Request resolved: pytorch#162539 Approved by: https://github.com/ezyang, https://github.com/tianyu-l

[inductor] pdl inductor option (disabled by default) (pytorch#160928)

c5e7bb0

Pull Request resolved: pytorch#160928 Approved by: https://github.com/eellison

[BE] Remove bottleneck (pytorch#163210)

c43ccfb

Some cleanup related to this RFC: pytorch#68742 Pull Request resolved: pytorch#163210 Approved by: https://github.com/ezyang

Reland pytorch#161649, vectorize stored in cat for all dtypes (pytorc…

14f8d86

…h#162440) Per title Pull Request resolved: pytorch#162440 Approved by: https://github.com/Skylion007

[ROCm] test_aot_inductor: Enable fp8 tests. (pytorch#163050)

8bc4a46

Pull Request resolved: pytorch#163050 Approved by: https://github.com/jeffdaily

fixing graph break for namedtuple._replace (pytorch#160139)

1f21f85

Fixes pytorch#158772 _replace works without graph break Pull Request resolved: pytorch#160139 Approved by: https://github.com/mlazos

[testing] Add test owner labels for some ao sparse tests (pytorch#163203

4908fb5

) I am trying to give some test files better owner labels than `module: unknown`. I am not sure them, but they seem pretty reasonable Pull Request resolved: pytorch#163203 Approved by: https://github.com/jcaip

[submodule] CUTLASS upgrade to 4.2.0 and change cutlass to cutlass_cp…

a81a2e5

…pgen (pytorch#163092) Pull Request resolved: pytorch#163092 Approved by: https://github.com/drisspg, https://github.com/Skylion007

Massive hack to make autograd shut up about threaded PG mutations (py…

e36a6fc

…torch#163238) See the Note for explanation. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#163238 Approved by: https://github.com/albanD

[DTensor] Fix DTensor.mean with uneven sharding (pytorch#163241)

bb7c9a2

Fixes pytorch#162692 When input is uneven sharded, redistribute input as Replicated. Pull Request resolved: pytorch#163241 Approved by: https://github.com/dcci

Turn on capture_scalar_outputs when fullgraph=True (pytorch#163121)

7dcb568

Pull Request resolved: pytorch#163121 Approved by: https://github.com/laithsakka

bobrenjc93 and others added 17 commits September 18, 2025 21:24

Turn on capture_dynamic_output_shape_ops when fullgraph=True (pytorch…

ed3438f

…#163123) Pull Request resolved: pytorch#163123 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163121

Try updating ET pin in PT/PT (pytorch#159664)

f4eca0e

Looking into resolving this: pytorch#159599 Test Plan: Wait for executorch CI Pull Request resolved: pytorch#159664 Approved by: https://github.com/malfet

Change DebugMode record_torchfunction default to False (pytorch#163293)

04842ac

Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean. Pull Request resolved: pytorch#163293 Approved by: https://github.com/zpcore

[ROCm] Bump FBGEMM commit to avoid CK errors (pytorch#162590)

c9b80c4

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#162590 Approved by: https://github.com/jeffdaily

add more restriction to fusion with large accumulate reads (pytorch#1…

6e680ae

…63163) Pull Request resolved: pytorch#163163 Approved by: https://github.com/yf225

support high stream

ef0ef7d

fix

7b9e09b

update

14ff47f

revert update pin commit

0308253

pytorchmergebot force-pushed the high_stream branch from 9d94d38 to 0308253 Compare September 19, 2025 02:14

commit

c06e6f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[xpu] Support high stream for ProcessGroupXCCL #24

[xpu] Support high stream for ProcessGroupXCCL #24

Uh oh!

Chao1Han commented Sep 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

[xpu] Support high stream for ProcessGroupXCCL #24

Are you sure you want to change the base?

[xpu] Support high stream for ProcessGroupXCCL #24

Uh oh!

Conversation

Chao1Han commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Chao1Han commented Sep 15, 2025 •

edited

Loading