Skip to content

Issue with trition / cuda module loading #10

@Jhaquen

Description

@Jhaquen

Hi PTM-Mamba team!

i was having fun playing around with your model on my local machine, but have trouble getting it to work on our cluster. The issue seems to be triton not being able to find necessary cuda libraries, unfortunately i understand to little of this to accurately describe the issue (maybe i should rather open a issue on the mamba github?). Ive attached the error i encounter as error.rtf

A brief descriptor of my setup:
i use the container as instructed and run it in interactive mode on a gpu node with the script (runContainer) i've attached. As far as i can tell GPU binding works but maybe theres an issue with accessing cuda libraries (see cuda, running the container with --nvccli throws error2.

What i've tried:
i assumed this was an issue with binding .triton but explicitly setting env variables to mounted locations changes nothing.
i've also tried to disable triton (a) by setting DEEPSPEED_DISABLE_TRITON=1 (this seems to not work since triton is loaded then disabled) (b) by mocking the module before loading, then disabling (this causes issues with loading mamba modules, i think. Either that or something else broke in the process)

Thanks in advance for your time!

error.rtf
error2.rtf
cuda.rtf
condaenv.rtf
runContainer.rtf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions