Skip to content

Conversation

BLOrange-AMD
Copy link

@BLOrange-AMD BLOrange-AMD commented Oct 6, 2025

Fixes SWDEV-551121
test_cuda.py::TestCudaOptimsCUDA::test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32
test_cuda.py::TestCudaOptimsCUDA::test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32
test_cuda.py::TestCudaOptimsCUDA::test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32

…torch#164000)

Fixes pytorch#160598
Fixes pytorch#160551
Fixes pytorch#160507

This PR fixes a bug in the `test_garbage_collect_expandable` unit test where the finally block incorrectly re-reads the current per process memory fraction instead of setting the original value. With out the fix the other tests in the `test/test_cuda.py` test suite were impacted and failed with OOM error on ROCm.

This ensures proper cleanup and isolation of test state, maintaining test correctness and avoiding side effects like the below OOM error that it caused.

For example, `test_autocast_checkpointing`  failed with the below error https://github.com/pytorch/pytorch/actions/runs/17982223758/job/51153974194 on ROCm

`torch.OutOfMemoryError: HIP out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 252.97 GiB is free. 1.20 GiB allowed; Of the allocated memory 1.14 GiB is allocated by PyTorch, with 17.00 MiB allocated in private pools (e.g., HIP Graphs), and 18.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
Pull Request resolved: pytorch#164000
Approved by: https://github.com/jeffdaily
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Oct 6, 2025

Jenkins build for 164858a49d06f63bd44b076f1bcf000c67f1a380 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants