forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
[xpu] Support high stream for ProcessGroupXCCL #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Chao1Han
wants to merge
46
commits into
master
Choose a base branch
from
high_stream
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
203f701
to
b4f70bf
Compare
This is not an API we want to support. Pull Request resolved: pytorch#162539 Approved by: https://github.com/ezyang, https://github.com/tianyu-l
…torch#163223) It should work with the current CUDA/ROCm device_capability enumeration anyway. But it will help to avoid unexpected triggering in the future Pull Request resolved: pytorch#163223 Approved by: https://github.com/jeffdaily
Pull Request resolved: pytorch#160928 Approved by: https://github.com/eellison
…its for flops and bandwidth (pytorch#162942) In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators. Testing: ``` import torch if torch.cuda.is_available(): device = torch.cuda.current_device() mod = torch.get_device_module('cuda') hw = mod._device_limits.GPULimits(device) print(hw.get_tflops_per_second(torch.float16)) print(hw.get_tflops_per_second(torch.float32)) print(hw.get_tflops_per_second(torch.float64)) print(hw.get_tflops_per_second(torch.bfloat16)) print(hw.get_tflops_per_second(torch.int8)) print(hw.get_memory_bandwidth_Bps() / 1e9) print(hw.get_shared_memory_bandwidth_Bps() / 1e9) # Output on an H100 GPU 1070.53056 535.26528 66.90816 1070.53056 2141.06112 4893.696 33454.08 ``` Pull Request resolved: pytorch#162942 Approved by: https://github.com/ngimel
…or operations (pytorch#162651) Also updates the error message to point to the guide. Pull Request resolved: pytorch#162651 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#162117, pytorch#162307
…eady padded (pytorch#163130) resolves pytorch/torchtitan#1136 torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already ``` # pad DTensor._local_tensor fully_shard(model) sd = fsdp_model.state_dict() # reset_sharded_param should be a no-op in lazy_init loss = fsdp_model(inp).sum() ``` this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` Pull Request resolved: pytorch#163130 Approved by: https://github.com/tianyu-l
Some cleanup related to this RFC: pytorch#68742 Pull Request resolved: pytorch#163210 Approved by: https://github.com/ezyang
…h#162440) Per title Pull Request resolved: pytorch#162440 Approved by: https://github.com/Skylion007
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32` Fixes pytorch#157094 Fixes pytorch#157093 Fixes pytorch#157092 Fixes pytorch#157091 Fixes pytorch#157064 Fixes pytorch#157063 Fixes pytorch#157062 Fixes pytorch#157061 Fixes pytorch#157042 Fixes pytorch#157041 Fixes pytorch#157039 Fixes pytorch#157004 Pull Request resolved: pytorch#162998 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>
Pull Request resolved: pytorch#163050 Approved by: https://github.com/jeffdaily
…61430) Update Microsoft C++ Redistributable link to the latest version as one of the libraries used by AMD currently has a dependency on that. Pull Request resolved: pytorch#161430 Approved by: https://github.com/malfet
Fixes pytorch#158772 _replace works without graph break Pull Request resolved: pytorch#160139 Approved by: https://github.com/mlazos
### DDE ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*u0, 0) (unhinted: Eq(3*u0, 0)). (Size-like symbols: u0) Caused by: (_decomp/decompositions.py:1185 in _softmax) ``` ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)). (Size-like symbols: u0) Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0) # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax) ``` ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)). (Size-like symbols: u0) Caused by: (_refs/__init__.py:2218 in _reduction) ``` ### Cannot convert symbols to int ``` File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "orch/_inductor/sizevars.py", line 591, in size_hint return int(out) ^^^^^^^^ File "sympy/core/expr.py", line 342, in __int__ raise TypeError("Cannot convert symbols to int") ``` Pull Request resolved: pytorch#162216 Approved by: https://github.com/laithsakka, https://github.com/eellison
) I am trying to give some test files better owner labels than `module: unknown`. I am not sure them, but they seem pretty reasonable Pull Request resolved: pytorch#163203 Approved by: https://github.com/jcaip
…ytorch#163214) In the .dist-info/METADATA file, the version was not being written with the new sha. On python <3.11 (I think), the glob `**` will only match directories, so change this to `*`, which I checked that it will match both files and directories on py3.9 and py3.13 There's probably also a bunch of mismatches in RECORD but thats a problem for later Pull Request resolved: pytorch#163214 Approved by: https://github.com/huydhn
These decompositions take precedence before CIA decomps in fake tensor prop, as a result, we would hit this implementation for all where overloads which is wrong in some cases. For the overloads that can't be implemented by this decomp, we just run the default CIA impl. Previously this doesn't matter because in post-dispatch IR, aten.where would have decomposed but when user tries to preserve aten.where this issue will surface because fake tensor will start seeing aten.where. Differential Revision: [D82604702](https://our.internmc.facebook.com/intern/diff/D82604702) Pull Request resolved: pytorch#163138 Approved by: https://github.com/henryoier, https://github.com/ezyang
…pytorch#160078) Note. This is a replica PR of pytorch#155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it. ### Summary 🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems ) This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments. ### Motivation Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability. Note: Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above. Co-authored-by: Usamah Zaheer <[email protected]> Pull Request resolved: pytorch#160078 Approved by: https://github.com/seemethere
…ware limits for flops and bandwidth (pytorch#162942)" This reverts commit 627482a. Reverted pytorch#162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](pytorch#162942 (comment)))
…pgen (pytorch#163092) Pull Request resolved: pytorch#163092 Approved by: https://github.com/drisspg, https://github.com/Skylion007
…zation pattern (pytorch#161848) Summary: What: Enables CUDA support for concat linear int8_mm woq optimization pattern by: - Updating pattern validation to accept CUDA devices - Adding test coverage for CUDA Why: Extend WOQ to more device types Test Plan: ``` buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm ``` Rollback Plan: Differential Revision: D80884518 Pull Request resolved: pytorch#161848 Approved by: https://github.com/jerryzh168
…torch#163238) See the Note for explanation. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#163238 Approved by: https://github.com/albanD
The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: pytorch#163194 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152
The input tensor shape does not match the weight tensor shape, which was detected by the validation logic implemented in my other PR(pytorch#160408). The input tensor should have a shape of (2, 2, 3), since dimension 1 of the input (representing input channels) must match dimension 0 of the weight tensor (representing input channels). ref https://docs.pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html Pull Request resolved: pytorch#163148 Approved by: https://github.com/eellison
…60920) ## Summary - include local file path in `inductor_output_code` structured trace metadata - adjust structured trace tests for new `file_path` field ## Testing - `python test/dynamo/test_structured_trace.py StructuredTraceTest.test_compile_id_serialization_deserialization` - `lintrunner -a torch/_inductor/codecache.py torch/_inductor/graph.py test/dynamo/test_structured_trace.py` *(fails: MYPY failure)* ------ https://chatgpt.com/codex/tasks/task_e_68a2b02b54ec8323ae820120605a9f1c Pull Request resolved: pytorch#160920 Approved by: https://github.com/oulgen
) Some tests that were already failing changed status to skipped. Some model entries were missing. Pull Request resolved: pytorch#163256 Approved by: https://github.com/malfet Co-authored-by: Jeff Daily <[email protected]>
## Summary - add a test verifying that editing the local cache wrapper is picked up after Dynamo reset ## Testing - `lintrunner -a` *(fails: FLAKE8 failure, TEST_HAS_MAIN failure, CODESPELL failure, PYFMT failure)* - `PYTHONPATH=. python test/inductor/test_codecache.py TestPyCodeCache.test_editable_cached_wrapper -v` ------ https://chatgpt.com/codex/tasks/task_e_68a3aa3fcc9883239b17d1f4250d1e89 Pull Request resolved: pytorch#160943 Approved by: https://github.com/xmfan, https://github.com/albanD
Fixes pytorch#162692 When input is uneven sharded, redistribute input as Replicated. Pull Request resolved: pytorch#163241 Approved by: https://github.com/dcci
Pull Request resolved: pytorch#163121 Approved by: https://github.com/laithsakka
…#163123) Pull Request resolved: pytorch#163123 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#163121
Looking into resolving this: pytorch#159599 Test Plan: Wait for executorch CI Pull Request resolved: pytorch#159664 Approved by: https://github.com/malfet
Initial prototype for dynamic int inputs, allows users to run with `torch.compile(f)(DynamicInt(4))`, compiling dynamically and using the underlying hint at runtime. Current behavior: - Also works in eager (mostly by subclassing int), as scalar input to torch functions, or numpy/math/etc. For example, `x = DynamicInt(3); torch.randn(x); torch.add(y, z, alpha=x); np.arange(x)` all act as if x = 3. - Behavior for arithmetic ops is to return new DynamicInts rather than static ints; `DynamicInt(3) * 2 = DynamicInt(6)`. This is via SymNode magic methods, but coverage might not be 100% - for example, I had to explicitly override floordiv to avoid int casting. This is not necessarily the case for non-magic method ops (e.g. `math.cos(x)`). The alternative here is to int cast on all operations, but I opted for this for dynamism propagation in non-compiled regions. - Doesn't ban fullgraph=False; DynamicInt objects might be leaked back to the user, but I guess this is fine, because they can be casted to ints when needed? - Dynamo only allocates one symbol per DynamicInt; specifying the same DynamicInt for multiple inputs leads to input deduplication, and a guard installed. - We don't raise on int specialization (in allowlist/maybe_mark_dynamic style) - but an easy change if needed. - DynamicInts as nn.Module attributes are handled. - We don't guard on the DynamicInt id, e.g. users can do the following without recompiling (maybe we should guard?) ```python x = DynamicInt(4) f(x) f(1) f(DynamicInt(3)) # same as f(3) ``` Follow-up work: - Specifying shape constraints, either at the int-level, e.g. ```python DynamicInt(64, name="s0", constraints=["s0 % 32 == 0", "s0 <= 1024"] ``` or at the compilation level, e.g. something like ```python s0 = DynamicInt(64, name="s0") s1 = DynamicInt(128, name="s1") with some_compiler_config.dynamic_int_constraints(["s1 == 2*s0", "s0 % 32 == 0"]): f(s0, s1) ``` This should subsume the need for specifying derived SymInts? - SymFloat support - currently it seems backed floats are specialized by the tensorify float pass, and there's no handling in inductor. - Propagating dynamism in tensor constructors, e.g. `x = DynamicInt(4); torch.randn(x)` could annotate `_dynamo_dynamic_indices`. Differential Revision: D81698719 Pull Request resolved: pytorch#162194 Approved by: https://github.com/bobrenjc93
…62540) This information can be obtained during the dispatching. Pull Request resolved: pytorch#162540 Approved by: https://github.com/ezyang, https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: pytorch#162539
pytorch#162987) Trying to reduce the number of __torch_dispatch__ calls of FakeTensorMode in the AOT metadata collection pass. Pull Request resolved: pytorch#162987 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh, https://github.com/zou3519
Summary: Point people lowering to lite interpreter to the existence of ExecuTorch. Added the typing deprecation, a warnings deprecation Test Plan: Try using it, see deprecation warning Reviewed By: lucylq Differential Revision: D82759566 Pull Request resolved: pytorch#163289 Approved by: https://github.com/larryliu0820
Result is too noisy with `record_torchfunction = True`. Change it to False, to make it clean. Pull Request resolved: pytorch#163293 Approved by: https://github.com/zpcore
What supposed to be a very simple change end up being quite involved, as current Windows CI framework is quite inflexible, i.e. it takes a lots of argument, but later on ignores them, namely: - `PYTHON_VERSION` used to be a no-op that is simply ignored by the scripts - With this change, `setup-win` action will create an environment called `py_tmp` with specific python version + intel-openmp (that is hard runtime requirement, but for some reason not packaged into the wheel nor marked as such) - Introduced `CONDA_ROOT_DIR` env variable in `activate_miniconda3.bat` to avoid `%CONDA_PARENT_DIR%\Miniconda3` invocations throughout the codebase - Copied test type dependencies from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 into `win-test.sh`, but made some adjustments to be compatible with 3.10 runtime (scipy version update) and just make rerun-tests compatible with the rest of the deps I think in the long run, one needs to update https://github.com/pytorch/test-infra/blob/4432e2cacd8a5bb7a46158e71d08c937c502e35a/aws/ami/windows/scripts/Installers/Install-Miniconda3.ps1 that currently pins Miniconda python to 3.9, but also figure out how CI can still create a new environment without having to download all the dependencies all the time Pull Request resolved: pytorch#162862 Approved by: https://github.com/wdvr, https://github.com/huydhn
We are not doing ring attention but only using allgather to do CP for Flex. Pull Request resolved: pytorch#162541 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: pytorch#162539, pytorch#162540
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#162590 Approved by: https://github.com/jeffdaily
…h#162740) Summary: This diff does a few things: - It refactors PrecompileContext to store DynamoCacheEntries directly on the context. This allows us at serialization time to check if the dynamo cache entry has all its backends ready for serialization, and if not, skip unnecessarily serializing it - It also gives us the ability to print out a `debug` JSON, which contains a mapping for everything being serialized and deserialized. Here's an example of what that JSON looks like: ``` { "artifacts": { "precompile_aot_autograd": [ "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d" ], "precompile_dynamo": [ { "backend_ids": [ "__compiled_fn_8_306d538b_f7f8_4ab4_98a1_b5ff4493f99d" ], "fn_name": "TorchBenchmarkRunner.forward_and_backward_pass", "num_codes": "10", "python_version": "3.12.11+meta", "torch_version": "2.10.0a0+fb" } ] }, "num_entries": 1 } ``` Test Plan: Existing tests pass. NanoGPT tlparse showing the new debug: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpeIsL5G/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Note that there aren't compile ids since we're logging this in PrecompileContext.serialize() for now, where there isn't a compile yet. I think this is fine for now, as no compile ID makes sense here. If anything, these kind of belong in a "Global" compile ID, which I will not implement in this PR. Rollback Plan: Differential Revision: D82232574 Pull Request resolved: pytorch#162740 Approved by: https://github.com/zhxchen17
…63163) Pull Request resolved: pytorch#163163 Approved by: https://github.com/yf225
Update the torch-xpu-ops commit to intel/torch-xpu-ops@24fab67, includes: - Clean up getDeviceIndexOfCurrentQueue - Fix hardswish gradients corner case - Fix xccl contiguous check - Move checks from nonzero kernel to operator - support high priority stream for xccl Pull Request resolved: pytorch#163244 Approved by: https://github.com/EikanWang
9d94d38
to
0308253
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add high priority stream support for ProcessGroupXCCL. Just like CUDA, XPU streams also support execution with higher priority compared to other streams. Implementation in intel/torch-xpu-ops#1715, add register here.