PyTorch C++ extension for GPU-accelerated neighbor list computation on AMD hardware (HIP/ROCm). Unlike a standalone HIP library, this extension shares PyTorch's HIP runtime, allocator, and stream, so the neighbor list can be composed with other PyTorch GPU ops in the same process without context conflicts.
- ROCm 5.0+ (tested with 6.0 and 7.0)
- PyTorch built with ROCm support
- A C++17 toolchain and
hipccon PATH (provided by ROCm)
# Tell HIP which GPU architecture to target (example: gfx1102 / RX 7600)
export HSA_OVERRIDE_GFX_VERSION=11.0.0
pip install -e .The build relies on PyTorch's CUDAExtension machinery, which routes through
hipcc when ROCm is detected. setup.py looks for /opt/rocm,
/opt/rocm-7.0.1, and /opt/rocm-6.0.0 and points CUDA_HOME at whichever
exists.
import torch
from hip_torch_nl import hip_torch_nl, HIP_TORCH_NL_AVAILABLE
assert HIP_TORCH_NL_AVAILABLE, "extension not built"
positions = torch.rand(1000, 3, device="cuda") * 10.0
cell = torch.eye(3, device="cuda") * 10.0
pbc = torch.tensor([True, True, True], device="cuda")
cutoff = torch.tensor(3.0)
mapping, shifts = hip_torch_nl(positions, cell, pbc, cutoff)
# mapping: (2, n_pairs) int64 — directed pairs (i, j) with i != j or shift != 0
# shifts: (n_pairs, 3) — integer cell shifts S such that
# D = pos[j] - pos[i] + S @ cellhip_torch_nl(
positions, cell, pbc, cutoff,
sort_id=False,
compatible_mode=True,
algorithm="auto",
)positions:(n_atoms, 3)float tensor on GPU.cell:(3, 3)row-vector cell matrix on GPU (rowiis the i-th cell vector).pbc:(3,)bool tensor on GPU.cutoff: scalar tensor or float.sort_id: ifTrue, sort the returned pairs by their first index.compatible_mode: ifTrue(default), filter results to exactly matchtorch_sim.neighbors.standard_nl. Set toFalsefor the raw kernel output (and to remove thetorch_simdependency at call time).algorithm:"auto"(default),"direct"/"v1", or"cell_list"/"v2"."auto"pickscell_listabove 15 000 atoms.
The convenience wrappers hip_torch_nl_v1 and hip_torch_nl_v2 force the
respective algorithm.
| Variant | Complexity | Memory | Practical limit (8 GB VRAM) |
|---|---|---|---|
V1 (direct) |
O(n²) brute force, MIC | high — pairs buffer scales with n² | ~16k atoms |
V2 (cell_list) |
O(n) cell list, MIC | int32 pair indices, density-based estimation | ~37k atoms |
Both algorithms apply the minimum image convention. They produce identical
pair sets when cutoff is smaller than half the smallest cell height; above
that, MIC is not appropriate for either implementation.
hip_torch_nl/
├── __init__.py # Python interface
├── csrc/
│ ├── hip_neighborlist.cpp # pybind11 bindings + algorithm dispatch
│ └── hip_neighborlist_kernel.cu # HIP kernels (V1 brute force, V2 cell list)
tests/ # pytest correctness suite
benchmarks/run_benchmarks.py # timing harness
pip install -e ".[test]"
pytestThe suite verifies output against a vectorized brute-force MIC reference
implemented in tests/conftest.py. It covers:
- random positions under full, partial, and zero PBC
- V1 vs V2 agreement
- FCC nearest-neighbor coordination number (12)
- pair-list symmetry,
sort_id, and dtype preservation - input validation (CPU tensors, bad shapes, missing extension)
Tests skip cleanly if the extension is not built or no HIP/CUDA device is available.
python -m benchmarks.run_benchmarks --sizes 1000 4000 16000
python -m benchmarks.run_benchmarks --include-reference # also time standard_nlThe script sweeps system sizes, runs V1, V2, and the auto selector, and
reports median and best-of wall-clock time. Pass --include-reference to
also time torch_sim.neighbors.standard_nl on CPU as a baseline. Use
--cutoff and --density to control the test geometry; the cutoff must
stay below half the cubic box height (the script enforces this).
BSD-3-Clause