Skip to content

Multi‑GPU Triton Kernel Device Binding Issue in ptm‑mamba #6

@Devin-jun

Description

@Devin-jun

Hi @pengzhangzhi ,

Thank you for the excellent work on PTM‑Mamba! I’ve been reproducing and using the model in a multi‑GPU setting and ran into an issue that may affect any multi‑GPU workflow.
When running inference on a non‑default GPU (e.g. cuda:1, cuda:2), I encounter the following Triton error:

File "/usr/local/lib/python3.10/dist-packages/triton/runtime/autotuner.py", line 81, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages,
File "", line 78, in _layer_norm_fwd_1pass_kernel
ValueError: Pointer argument (at 7) cannot be accessed from Triton (cpu tensor?)

Triton JIT‑compiles its kernels to the current default CUDA device (i.e. torch.cuda.current_device()) at the time of first import or first call. If you later switch to another GPU (e.g. via torch.cuda.set_device(1) or model.to('cuda:1')), Triton will still try to run the previously‑compiled kernel on the new device. The pointer then points to memory on a different GPU, triggering the permission error.

to fix

A short‐term workaround is to delay or rebind the Triton compilation to the active device by adding a reference_compile flag in the from_pretrained() method. For example, in
ptm-mamba/protein_lm/modeling/models/mamba/lm.py around line 418, change:
def from_pretrained(cls, pretrained_model_name, device=None, dtype=None, reference_compile=False, # new flag **kwargs):
Then, when calling from_pretrained(), pass reference_compile=False to force Triton to compile for the currently set device.
Thanks again for this great library—hope this helps improve the multi‑GPU experience!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions