-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi @pengzhangzhi ,
Thank you for the excellent work on PTM‑Mamba! I’ve been reproducing and using the model in a multi‑GPU setting and ran into an issue that may affect any multi‑GPU workflow.
When running inference on a non‑default GPU (e.g. cuda:1, cuda:2), I encounter the following Triton error:
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 81, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages,
File "", line 78, in _layer_norm_fwd_1pass_kernel
ValueError: Pointer argument (at 7) cannot be accessed from Triton (cpu tensor?)
Triton JIT‑compiles its kernels to the current default CUDA device (i.e. torch.cuda.current_device()) at the time of first import or first call. If you later switch to another GPU (e.g. via torch.cuda.set_device(1) or model.to('cuda:1')), Triton will still try to run the previously‑compiled kernel on the new device. The pointer then points to memory on a different GPU, triggering the permission error.
to fix
A short‐term workaround is to delay or rebind the Triton compilation to the active device by adding a reference_compile flag in the from_pretrained() method. For example, in
ptm-mamba/protein_lm/modeling/models/mamba/lm.py around line 418, change:
def from_pretrained(cls, pretrained_model_name, device=None, dtype=None, reference_compile=False, # new flag **kwargs):
Then, when calling from_pretrained(), pass reference_compile=False to force Triton to compile for the currently set device.
Thanks again for this great library—hope this helps improve the multi‑GPU experience!