Weighted Importance Meets Iterative Pruning for Zero-Shot LLM Compression
WIP is a post-training pruning method designed to shrink large language models (LLMs) without retraining or fine-tuning. It enhances recent methods (e.g., Wanda, RIA, SparseGPT) by introducing:
- Weighted Importance Metric: A tunable metric balancing row-wise and column-wise contributions to avoid channel collapse.
- Iterative Multi-stage Pruning: Incrementally recalculates importance scores, reducing accuracy degradation common in one-shot pruning.
Create a new conda environment and install dependencies:
conda create -n wip python=3.10
conda activate wip
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtOptional for zero-shot evaluation:
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .Replace YOUR_MODEL_NAME (e.g., "huggingface/llama2-7b") in the commands below.
python main.py \
--model YOUR_MODEL_NAME \
--prune_method wip \
--sparsity_ratio 0.5 \
--sparsity_type unstructured \
--savepython main.py \
--model YOUR_MODEL_NAME \
--prune_method wip \
--sparsity_ratio 0.5 \
--sparsity_type 2:4 \
--saveWe measure inference acceleration using GPUs that support semi-structured sparsity (e.g., NVIDIA Ampere and Hopper architectures). Specifically, we leverage TensorRT-LLM to run models with an N:M sparsity pattern. For more details, refer to this issue.
This repository extends prior work from SparseGPT, Wanda, and RIA.
For questions or suggestions, feel free to contact us.