Memory Efficient LM Compression using Fisher Information from Low-Rank Representations

"Memory Efficient LM Compression using Fisher Information from Low-Rank Representations" Daniil Moskovskiy, Sergey Pletenev, Sergey Zagoruyko, Alexander Panchenko

The code allows reproducing experiments on compressing Large Language Models (LLMs) like Llama-2 and Llama-3.1 using Fisher-Weighted Singular Value Decomposition (FWSVD). The Fisher information is efficiently estimated using gradients obtained during Low-Rank Adaptation (LoRA) fine-tuning.

📚 Table of Contents

Introduction
Features
Repository Structure
Prerequisites
Installation
Usage
- Step 1: Generate Fisher Information
- Step 2: Apply Compression
Citation
License

💡 Introduction

Large Language Models (LLMs) have shown remarkable capabilities but often come with significant computational and memory costs. Model compression techniques are crucial for deploying these models in resource-constrained environments. This work explores Singular Value Decomposition (SVD) for compressing LLM weight matrices, enhanced by incorporating Fisher information.

We propose using Fisher Information, estimated from the gradients during LoRA fine-tuning, as a weighting mechanism for SVD (FWSVD). This approach aims to preserve the most critical parts of the weight matrices, leading to better performance retention after compression compared to standard SVD. This repository provides the necessary tools to:

Estimate Fisher Information for specific layers of Llama models using LoRA.
Apply both standard SVD and Fisher-Weighted SVD (FWSVD) to compress these models.

✨ Features

Implementation of Fisher-Weighted SVD (FWSVD) for LLM compression.
Efficient Fisher Information estimation using LoRA gradients via trl and peft.
Script (generate_fisher_llm.py) to compute and save Fisher Information estimates.
Script (compress_llama.py) to apply SVD or FWSVD compression to Llama models (e.g., Llama-2, Llama-3.1).
Customizable compression ranks per layer.
Integration with Hugging Face transformers, peft, and trl libraries.

📂 Repository Structure

.
├── LICENSE
├── README.md
├── compress_llama.py
├── generate_fisher_llm.py
└── llama_comp/
├── init.py
├── compression_utils.py
└── utils_svd.py

✅ Prerequisites

Python 3.8+
PyTorch (>= 1.13)
CUDA-enabled GPU (recommended for reasonable performance)
Hugging Face Hub account and token (if using gated models like Llama-2/3)

⚙️ Installation

Clone the repository:

git clone https://github.com/s-nlp/lora_fwsvd.git
cd lora_fwsvd

Set up a virtual environment (recommended):

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets peft trl accelerate bitsandbytes numpy scikit-learnd

Note: bitsandbytes might be needed if using quantization features from transformers/peft, although not explicitly shown in the provided snippets.

Log in to Hugging Face Hub (if needed):
```
huggingface-cli login
```

▶️ Usage

The process involves two main steps: generating the Fisher Information weights and then applying the compression using these weights.

Step 1: Generate Fisher Information

This step uses LoRA fine-tuning on a dataset to estimate the Fisher Information for the target model's weights.

Configure generate_fisher_llm.py:
- Modify the dataset loading part if you want to use a different dataset (e.g., wikitext, c4, etc.). The current example uses "robbiegwaldd/dclm-micro".
- Adjust peft_config (e.g., r, lora_alpha, target_modules) and training_args (e.g., max_seq_length, per_device_train_batch_size, num_train_epochs, output_dir) as needed.
- Set the base model name in the CustomTrainer initialization (e.g., "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-3.1-8B").
Run the script:
```
python generate_fisher_llm.py
```
Output: This will generate a pickle file (e.g., ./_tmp6_llama2_7b_all-linear/fisher_XXXX.pkl) containing a dictionary where keys are layer names and values are the squared gradients (Fisher estimates) accumulated during training.

Step 2: Apply Compression

This step takes the original model and the generated Fisher Information to perform SVD or FWSVD compression.

Configure compress_llama.py:
- Set the pretrained variable to the Hugging Face model identifier you want to compress (must match the one used for Fisher generation or be compatible).
- Load the Fisher Information: Add code to load the .pkl file generated in Step 1.
```
import pickle
import torch 

fisher_path = './_tmp6_llama2_7b_all-linear/fisher_XXXX.pkl'
with open(fisher_path, 'rb') as fp:
    fisher_raw = pickle.load(fp)
fisher = {k: v.cpu().float() for k, v in fisher_raw.items()}
model_.to_compression(compress=True,
                    weight=fisher if fisher else None,
                    )
```
- Define the compression rank (rank) for each layer. The example uses a dictionary trunc_raw. You can set a global rank or specify per layer. rank=0 skips compression for that layer.
- Specify which layers to compress using layer_mask (a regex pattern).
- Choose the compression_type: 'svd' (uses w_svd from utils_svd.py, which implements standard SVD if weight=None and FWSVD if weight is provided).
- Set the save_pretrained path to where you want to save the compressed model and tokenizer.
Run the script:
```
python compress_llama.py
```
Output: The script will print the compression ratio and save the compressed model artifacts (config, weights, tokenizer) to the specified directory (e.g., ./compressed_llamas/llama-Llama-3-8b-hf-09-svd). This compressed model can then be loaded using AutoModelForCausalLM.from_pretrained(...).

🎓 Citation

TBA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory Efficient LM Compression using Fisher Information from Low-Rank Representations

📚 Table of Contents

💡 Introduction

✨ Features

📂 Repository Structure

✅ Prerequisites

⚙️ Installation

▶️ Usage

Step 1: Generate Fisher Information

Step 2: Apply Compression

🎓 Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
llama_comp		llama_comp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compress_llama.py		compress_llama.py
generate_fisher_llm.py		generate_fisher_llm.py

License

s-nlp/lora_fwsvd

Folders and files

Latest commit

History

Repository files navigation

Memory Efficient LM Compression using Fisher Information from Low-Rank Representations

📚 Table of Contents

💡 Introduction

✨ Features

📂 Repository Structure

✅ Prerequisites

⚙️ Installation

▶️ Usage

Step 1: Generate Fisher Information

Step 2: Apply Compression

🎓 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages