quantization

Dec 16, 2024

db7dee4 · Dec 16, 2024

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Update README.md	Dec 9, 2024
model_split.py	model_split.py	update	Dec 4, 2024
quantization.py	quantization.py	update	Dec 16, 2024
quantization_GPTQModel.py	quantization_GPTQModel.py	update	Dec 16, 2024
quantization_HF.py	quantization_HF.py	update	Dec 16, 2024
upload_quantized_model.py	upload_quantized_model.py	update	Dec 4, 2024

README.md

Quantize a Large Language Model (LLM) Using GPTQ

First, in the project home directory, please copy and paste these files,

cp -r ./quantization/quantization.py ./
cp -r ./quantization/quantization_GPTQModel.py ./
cp -r ./quantization/quantization_HF.py ./
cp -r ./quantization/upload_quantized_model.py ./

Meanwhile, please use your own Hugging Face READ and WRITE tokens in the config_quantization.yml file.

Conduct quantization based on GPTQ algorithm

For quantization.py, we are using Python AutoGPTQ package to conduct quantization.

python quantization.py "meta-llama/Meta-Llama-3-70B-Instruct" "./gptq_model" --bits 4 --group_size 128 --desc_act 1 --dtype bfloat16 --seqlen 2048 --damp 0.01

For quantization_GPTQModel.py, we are using Python GPTQModel package to conduct quantization.

pip install -v gptqmodel --no-build-isolation

Then,

python quantization_GPTQModel.py "meta-llama/Llama-3.3-70B-Instruct" "./gptq_model" --bits 4 --group_size 128 --seqlen 2048 --damp 0.01 --desc_act 1 --dtype bfloat16

For quantization_HF.py, we are using Hugging Face transformers package to conduct quantization.

python quantization_HF.py --repo "meta-llama/Meta-Llama-3-70B-Instruct" --bits 4 --group_size 128

Upload the quantized model to Hugging Face

python upload_quantized_model.py --repo "shuyuej/MedLLaMA3-70B-BASE-MODEL-QUANT" --folder_path "./gptq_model"

Change the model config files

In the config.json file, please change the "architectures" to LLaMAForCausalLM if it is a LLaMA model.
We don't specifically automatically upload the tokenizer files.
Please manually download them from Hugging Face official repo and upload them to your repo.

Model Split

We also provide a script to split a large SafeTensor file into smaller shards. The large file will be saved into 5GB shards and a model.safetensors.index.json will also be saved.

python model_split.py --large_file "gptq_model/model.safetensors" --output_dir "split_model" --max_size_gb 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

quantization

quantization

README.md

Quantize a Large Language Model (LLM) Using GPTQ

Conduct quantization based on GPTQ algorithm

Upload the quantized model to Hugging Face

Change the model config files

Model Split

Files

quantization

Directory actions

More options

Directory actions

More options

Latest commit

History

quantization

Folders and files

parent directory

README.md

Quantize a Large Language Model (LLM) Using GPTQ

Conduct quantization based on GPTQ algorithm

Upload the quantized model to Hugging Face

Change the model config files

Model Split