Skip to content

LLM inference in C/C++, with Nexa AI's support for audio language model and swift binding

License

Notifications You must be signed in to change notification settings

NexaAI/llama.cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama.cpp

This repo is cloned from llama.cpp commit 74d73dc85cc2057446bf63cc37ff649ae7cebd80. It is compatible with llama-cpp-python commit 7ecdd944624cbd49e4af0a5ce1aa402607d58dcc

Customize quantization group size at compilation (CPU inference only)

The only thing that is different is to add -DQK4_0 flag when cmake.

cmake -B build_cpu_g128 -DQK4_0=128
cmake --build build_cpu_g128

To quantize the model with the customized group size, run

./build_cpu_g128/bin/llama-quantize <model_path.gguf> <quantization_type>

To run the quantized model, run

./build_cpu_g128/bin/llama-cli -m <quantized_model_path.gguf>

Note:

You should make sure that the model you run is quantized to the same group size as the one you compile with. Or you'll receive a runtime error when loading the model.

About

LLM inference in C/C++, with Nexa AI's support for audio language model and swift binding

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 57.2%
  • C 20.8%
  • Python 7.4%
  • Cuda 6.8%
  • Metal 2.7%
  • Objective-C 2.3%
  • Other 2.8%