Issue with trition / cuda module loading

Hi PTM-Mamba team!

i was having fun playing around with your model on my local machine, but have trouble getting it to work on our cluster. The issue seems to be triton not being able to find necessary cuda libraries, unfortunately i understand to little of this to accurately describe the issue (maybe i should rather open a issue on the mamba github?). Ive attached the error i encounter as error.rtf

A brief descriptor of my setup:
i use the container as instructed and run it in interactive mode on a gpu node with the script (runContainer) i've attached. As far as i can tell GPU binding works but maybe theres an issue with accessing cuda libraries (see cuda, running the container with --nvccli throws error2.

What i've tried:
i assumed this was an issue with binding .triton but explicitly setting env variables to mounted locations changes nothing.
i've also tried to disable triton (a) by setting DEEPSPEED_DISABLE_TRITON=1 (this seems to not work since triton is loaded then disabled) (b) by mocking the module before loading, then disabling (this causes issues with loading mamba modules, i think. Either that or something else broke in the process) 

Thanks in advance for your time!

[error.rtf](https://github.com/user-attachments/files/23546284/error.rtf)
[error2.rtf](https://github.com/user-attachments/files/23546288/error2.rtf)
[cuda.rtf](https://github.com/user-attachments/files/23546285/cuda.rtf)
[condaenv.rtf](https://github.com/user-attachments/files/23546286/condaenv.rtf)
[runContainer.rtf](https://github.com/user-attachments/files/23546287/runContainer.rtf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with trition / cuda module loading #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue with trition / cuda module loading #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions