Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load model on the GPU with llama-cpp-python (Windows) #130

Open
silicode17 opened this issue Mar 9, 2024 · 3 comments
Open

Cannot load model on the GPU with llama-cpp-python (Windows) #130

silicode17 opened this issue Mar 9, 2024 · 3 comments

Comments

@silicode17
Copy link

This is how I am loading the model using Python, but it uses only the CPU:

Llama(model_path="./functionary-7b-v2.q4_0.gguf", n_ctx=4096, n_gpu_layers=50)

I have also tried to re-install llama-cpp-python using the instructions below but that didn't help:

set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir

My GPU has only 8GB of VRAM, could that be the reason? I saw in the readme that this model requires 24GB of VRAM...
However, other models such as Mistral are loading on my GPU just fine. So I am assuming that my Cuda installation is correct.

@jeffrey-fong
Copy link
Contributor

Hi, we have recently integrated our models into llama-cpp-python directly. Here's how you can use it. Can you try it and see if it works now?

I tested it on my end using the following code and the model loads using 4.835GB GPU VRAM.

llm = Llama.from_pretrained(
  repo_id="meetkai/functionary-7b-v2-GGUF",
  filename="functionary-7b-v2.q4_0.gguf",
  chat_format="functionary-v2",
  tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-7b-v2-GGUF"),
  n_ctx=4096,
  n_gpu_layers=-1,
)

@silicode17
Copy link
Author

Hi, we have recently integrated our models into llama-cpp-python directly. Here's how you can use it. Can you try it and see if it works now?

I tested it on my end using the following code and the model loads using 4.835GB GPU VRAM.

llm = Llama.from_pretrained(
  repo_id="meetkai/functionary-7b-v2-GGUF",
  filename="functionary-7b-v2.q4_0.gguf",
  chat_format="functionary-v2",
  tokenizer=LlamaHFTokenizer.from_pretrained("meetkai/functionary-7b-v2-GGUF"),
  n_ctx=4096,
  n_gpu_layers=-1,
)

Yes it works.
Quick question: is there a way to load a local GGUF file instead of downloading it from the hub?

@jeffreymeetkai
Copy link
Collaborator

Sorry for being so late but yes, you can load a local GGUF file by just initializing the Llama class directly. Here's a guide showing how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants