"Memory Efficient LM Compression using Fisher Information from Low-Rank Representations" Daniil Moskovskiy, Sergey Pletenev, Sergey Zagoruyko, Alexander Panchenko
The code allows reproducing experiments on compressing Large Language Models (LLMs) like Llama-2 and Llama-3.1 using Fisher-Weighted Singular Value Decomposition (FWSVD). The Fisher information is efficiently estimated using gradients obtained during Low-Rank Adaptation (LoRA) fine-tuning.
Large Language Models (LLMs) have shown remarkable capabilities but often come with significant computational and memory costs. Model compression techniques are crucial for deploying these models in resource-constrained environments. This work explores Singular Value Decomposition (SVD) for compressing LLM weight matrices, enhanced by incorporating Fisher information.
We propose using Fisher Information, estimated from the gradients during LoRA fine-tuning, as a weighting mechanism for SVD (FWSVD). This approach aims to preserve the most critical parts of the weight matrices, leading to better performance retention after compression compared to standard SVD. This repository provides the necessary tools to:
- Estimate Fisher Information for specific layers of Llama models using LoRA.
- Apply both standard SVD and Fisher-Weighted SVD (FWSVD) to compress these models.
- Implementation of Fisher-Weighted SVD (FWSVD) for LLM compression.
- Efficient Fisher Information estimation using LoRA gradients via
trl
andpeft
. - Script (
generate_fisher_llm.py
) to compute and save Fisher Information estimates. - Script (
compress_llama.py
) to apply SVD or FWSVD compression to Llama models (e.g., Llama-2, Llama-3.1). - Customizable compression ranks per layer.
- Integration with Hugging Face
transformers
,peft
, andtrl
libraries.
.
├── LICENSE
├── README.md
├── compress_llama.py
├── generate_fisher_llm.py
└── llama_comp/
├── init.py
├── compression_utils.py
└── utils_svd.py
- Python 3.8+
- PyTorch (>= 1.13)
- CUDA-enabled GPU (recommended for reasonable performance)
- Hugging Face Hub account and token (if using gated models like Llama-2/3)
-
Clone the repository:
git clone https://github.com/s-nlp/lora_fwsvd.git cd lora_fwsvd
-
Set up a virtual environment (recommended):
python -m venv venv source venv/bin/activate
-
Install dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install transformers datasets peft trl accelerate bitsandbytes numpy scikit-learnd
Note:
bitsandbytes
might be needed if using quantization features fromtransformers
/peft
, although not explicitly shown in the provided snippets. -
Log in to Hugging Face Hub (if needed):
huggingface-cli login
The process involves two main steps: generating the Fisher Information weights and then applying the compression using these weights.
This step uses LoRA fine-tuning on a dataset to estimate the Fisher Information for the target model's weights.
-
Configure
generate_fisher_llm.py
:- Modify the
dataset
loading part if you want to use a different dataset (e.g.,wikitext
,c4
, etc.). The current example uses"robbiegwaldd/dclm-micro"
. - Adjust
peft_config
(e.g.,r
,lora_alpha
,target_modules
) andtraining_args
(e.g.,max_seq_length
,per_device_train_batch_size
,num_train_epochs
,output_dir
) as needed. - Set the base model name in the
CustomTrainer
initialization (e.g.,"meta-llama/Llama-2-7b-hf"
,"meta-llama/Llama-3.1-8B"
).
- Modify the
-
Run the script:
python generate_fisher_llm.py
-
Output: This will generate a pickle file (e.g.,
./_tmp6_llama2_7b_all-linear/fisher_XXXX.pkl
) containing a dictionary where keys are layer names and values are the squared gradients (Fisher estimates) accumulated during training.
This step takes the original model and the generated Fisher Information to perform SVD or FWSVD compression.
-
Configure
compress_llama.py
:- Set the
pretrained
variable to the Hugging Face model identifier you want to compress (must match the one used for Fisher generation or be compatible). - Load the Fisher Information: Add code to load the
.pkl
file generated in Step 1.
import pickle import torch fisher_path = './_tmp6_llama2_7b_all-linear/fisher_XXXX.pkl' with open(fisher_path, 'rb') as fp: fisher_raw = pickle.load(fp) fisher = {k: v.cpu().float() for k, v in fisher_raw.items()} model_.to_compression(compress=True, weight=fisher if fisher else None, )
- Define the compression rank (
rank
) for each layer. The example uses a dictionarytrunc_raw
. You can set a global rank or specify per layer.rank=0
skips compression for that layer. - Specify which layers to compress using
layer_mask
(a regex pattern). - Choose the
compression_type
:'svd'
(usesw_svd
fromutils_svd.py
, which implements standard SVD ifweight=None
and FWSVD ifweight
is provided). - Set the
save_pretrained
path to where you want to save the compressed model and tokenizer.
- Set the
-
Run the script:
python compress_llama.py
-
Output: The script will print the compression ratio and save the compressed model artifacts (config, weights, tokenizer) to the specified directory (e.g.,
./compressed_llamas/llama-Llama-3-8b-hf-09-svd
). This compressed model can then be loaded usingAutoModelForCausalLM.from_pretrained(...)
.
TBA