Skip to content

Conversation

Chao1Han
Copy link
Owner

@Chao1Han Chao1Han commented Sep 15, 2025

Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in intel/torch-xpu-ops#1715, add register here.

@Chao1Han Chao1Han changed the title support high stream [xpu] Support high stream for ProcessGroupXCCL Sep 16, 2025
fegin and others added 28 commits September 18, 2025 06:20
…torch#163223)

It should work with the current CUDA/ROCm device_capability enumeration anyway. But it will help to avoid unexpected triggering in the future

Pull Request resolved: pytorch#163223
Approved by: https://github.com/jeffdaily
…its for flops and bandwidth (pytorch#162942)

In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: pytorch#162942
Approved by: https://github.com/ngimel
…or operations (pytorch#162651)

Also updates the error message to point to the guide.

Pull Request resolved: pytorch#162651
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#162117, pytorch#162307
…eady padded (pytorch#163130)

resolves pytorch/torchtitan#1136

torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already

```
# pad DTensor._local_tensor
fully_shard(model)
sd = fsdp_model.state_dict()
# reset_sharded_param should be a no-op in lazy_init
loss = fsdp_model(inp).sum()
```

this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early

unit test
```
pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict
```

Pull Request resolved: pytorch#163130
Approved by: https://github.com/tianyu-l
Some cleanup related to this RFC: pytorch#68742
Pull Request resolved: pytorch#163210
Approved by: https://github.com/ezyang
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes pytorch#157094
Fixes pytorch#157093
Fixes pytorch#157092
Fixes pytorch#157091
Fixes pytorch#157064
Fixes pytorch#157063
Fixes pytorch#157062
Fixes pytorch#157061
Fixes pytorch#157042
Fixes pytorch#157041
Fixes pytorch#157039
Fixes pytorch#157004

Pull Request resolved: pytorch#162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
…61430)

Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that.

Pull Request resolved: pytorch#161430
Approved by: https://github.com/malfet
Fixes pytorch#158772
_replace works without graph break

Pull Request resolved: pytorch#160139
Approved by: https://github.com/mlazos
### DDE

```
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*u0, 0) (unhinted: Eq(3*u0, 0)).  (Size-like symbols: u0)

Caused by: (_decomp/decompositions.py:1185 in _softmax)
```

```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)).  (Size-like symbols: u0)

Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0)  # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax)
```

```
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)).  (Size-like symbols: u0)

Caused by: (_refs/__init__.py:2218 in _reduction)
```

### Cannot convert symbols to int
```
  File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online
    and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "orch/_inductor/sizevars.py", line 591, in size_hint
    return int(out)
           ^^^^^^^^
  File "sympy/core/expr.py", line 342, in __int__
    raise TypeError("Cannot convert symbols to int")
```

Pull Request resolved: pytorch#162216
Approved by: https://github.com/laithsakka, https://github.com/eellison
)

I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable
Pull Request resolved: pytorch#163203
Approved by: https://github.com/jcaip
…ytorch#163214)

In the .dist-info/METADATA file, the version was not being written with the new sha.

On python <3.11 (I think), the glob `**` will only match directories, so change this to `*`, which I checked that it will match both files and directories on py3.9 and py3.13

There's probably also a bunch of mismatches in RECORD but thats a problem for later
Pull Request resolved: pytorch#163214
Approved by: https://github.com/huydhn
These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where.

Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702)
Pull Request resolved: pytorch#163138
Approved by: https://github.com/henryoier, https://github.com/ezyang
…pytorch#160078)

Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.

### Summary

🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )

This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.

### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.

Note:

Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.

Co-authored-by: Usamah Zaheer <[email protected]>

Pull Request resolved: pytorch#160078
Approved by: https://github.com/seemethere
…ware limits for flops and bandwidth (pytorch#162942)"

This reverts commit 627482a.

Reverted pytorch#162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](pytorch#162942 (comment)))
…zation pattern (pytorch#161848)

Summary:
What: Enables CUDA support for concat linear int8_mm woq optimization pattern by:

- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Differential Revision: D80884518

Pull Request resolved: pytorch#161848
Approved by: https://github.com/jerryzh168
…torch#163238)

See the Note for explanation.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: pytorch#163238
Approved by: https://github.com/albanD
The test used a wrong ptr to refer to remote address:
```
            dst_ptr = out_hdl.buffer_ptrs[peer]
            src_ptr = inp_hdl.buffer_ptrs[rank]
            sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.

Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.

Pull Request resolved: pytorch#163194
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#163025, pytorch#163152
The input tensor shape does not match the weight tensor shape, which was detected by the validation logic implemented in my other PR(pytorch#160408).  The input tensor should have a shape of (2, 2, 3), since dimension 1 of the input (representing input channels) must match dimension 0 of the weight tensor (representing input channels). ref https://docs.pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html

Pull Request resolved: pytorch#163148
Approved by: https://github.com/eellison
…60920)

## Summary
- include local file path in `inductor_output_code` structured trace metadata
- adjust structured trace tests for new `file_path` field

## Testing
- `python test/dynamo/test_structured_trace.py StructuredTraceTest.test_compile_id_serialization_deserialization`
- `lintrunner -a torch/_inductor/codecache.py torch/_inductor/graph.py test/dynamo/test_structured_trace.py` *(fails: MYPY failure)*

------
https://chatgpt.com/codex/tasks/task_e_68a2b02b54ec8323ae820120605a9f1c

Pull Request resolved: pytorch#160920
Approved by: https://github.com/oulgen
)

Some tests that were already failing changed status to skipped.  Some model entries were missing.

Pull Request resolved: pytorch#163256
Approved by: https://github.com/malfet

Co-authored-by: Jeff Daily <[email protected]>
## Summary
- add a test verifying that editing the local cache wrapper is picked up after Dynamo reset

## Testing
- `lintrunner -a` *(fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure)*
- `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v`

------
https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89

Pull Request resolved: pytorch#160943
Approved by: https://github.com/xmfan, https://github.com/albanD
Fixes pytorch#162692

When input is uneven sharded, redistribute input as Replicated.

Pull Request resolved: pytorch#163241
Approved by: https://github.com/dcci
bobrenjc93 and others added 17 commits September 18, 2025 21:24
Looking into resolving this: pytorch#159599

Test Plan: Wait for executorch CI
Pull Request resolved: pytorch#159664
Approved by: https://github.com/malfet
Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime.

Current behavior:
- Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3.
- Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions.
- Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed?
- Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed.
- We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed.
- DynamicInts as nn.Module attributes are handled.
- We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?)
```python
x = DynamicInt(4)
f(x)
f(1)
f(DynamicInt(3))  # same as f(3)
```

Follow-up work:
- Specifying shape constraints, either at the int-level, e.g.
```python
DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"]
```
or at the compilation level, e.g. something like
```python
s0 = DynamicInt(64, name="s0")
s1 = DynamicInt(128, name="s1")
with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]):
    f(s0, s1)
```
This should subsume the need for specifying derived SymInts?
- SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor.
- Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`.

Differential Revision: D81698719

Pull Request resolved: pytorch#162194
Approved by: https://github.com/bobrenjc93
pytorch#162987)

Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass.

Pull Request resolved: pytorch#162987
Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519
Summary:
Point people lowering to lite interpreter to the existence of ExecuTorch.

Added the typing deprecation, a warnings deprecation

Test Plan: Try using it, see deprecation warning

Reviewed By: lucylq

Differential Revision: D82759566

Pull Request resolved: pytorch#163289
Approved by: https://github.com/larryliu0820
Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean.

Pull Request resolved: pytorch#163293
Approved by: https://github.com/zpcore
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely:
 - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts
 - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such)
 - Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase
 - Copied test type dependencies from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps

I think in the long run, one needs to update https://github.com/pytorch/test-infra/blob/4432e2cacd8a5bb7a46158e71d08c937c502e35a/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time
Pull Request resolved: pytorch#162862
Approved by: https://github.com/wdvr, https://github.com/huydhn
We are not doing ring attention but only using allgather to do CP for Flex.
Pull Request resolved: pytorch#162541
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: pytorch#162539, pytorch#162540
…h#162740)

Summary:
This diff does a few things:
- It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it
- It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized.

Here's an example of what that JSON looks like:

```
{
  "artifacts": {
    "precompile_aot_autograd": [
      "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
    ],
    "precompile_dynamo": [
      {
        "backend_ids": [
          "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d"
        ],
        "fn_name": "TorchBenchmarkRunner.forward_and_backward_pass",
        "num_codes": "10",
        "python_version": "3.12.11+meta",
        "torch_version": "2.10.0a0+fb"
      }
    ]
  },
  "num_entries": 1
}
```

Test Plan:
Existing tests pass.

NanoGPT tlparse showing the new debug:

https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR.

Rollback Plan:

Differential Revision: D82232574

Pull Request resolved: pytorch#162740
Approved by: https://github.com/zhxchen17
Update the torch-xpu-ops commit to intel/torch-xpu-ops@24fab67, includes:

- Clean up getDeviceIndexOfCurrentQueue
- Fix hardswish gradients corner case
- Fix xccl contiguous check
- Move checks from nonzero kernel to operator
- support high priority stream for xccl

Pull Request resolved: pytorch#163244
Approved by: https://github.com/EikanWang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.