$D^2$-MoE: Delta Decompression for MoE-based LLMs Compression
Hao Gu1, Wei Li2, Lujun Li1, Qiyuan Zhu1, Mark Lee2, Shengjie Sun3, Wei Xue1, Yike Guo1
1Hong Kong University of Science and Technology, 2University of Birmingham, 3AISpeech Co
This repository contains the code implementation of D^2-MoE, a framework for compressing Mixture-of-Experts (MoE) based Large Language Models (LLMs) through delta decompression. D^2-MoE aims to reduce the parameters of MoE LLMs without the need for additional training.
D^2-MoE introduces a delta decompression method for compressing MoE LLMs. It strategically decomposes expert weights into a shared base weight and expert-specific delta weights to reduce redundancy and improve compression efficiency. The framework is designed to balance compression ratios while maintaining model performance.
pip install -r requirements.txt
pip install flash-attn==2.5.9.post1
cd lm-evaluation-harness
pip install -e .
bash mixtral.sh
modify the save_path and base_model_path
python preprocess/get_expert_freq.py \
--base_model_path=your_model_path \
--save_path=your_save_path \
--model_type=mixtral \
--dataset_name=wikitext \
--split=train \
--seed=42 \
--max_samples=20000 \
python preprocess/get_fisher.py \
--base_model_path=your_model_path \
--save_path=your_save_path \
--num_samples=1024 \
--scale_type fisher \
python preprocess/get_scale.py \
--base_model_path=your_model_path \
--save_path=your_save_path \
--model_type=mixtral \
--dataset_name=wikitext \
--split=train \
--seed=42 \
--max_samples=256 \
the total compression ratio can be calucate by cal_params.py with pp_ratio and delta_ratio; do not modify --control_name
get delta ratio target_compression_ratio means the final compress ratio you want, pp_ratio means the pruning ratio of base weight given a target_compression_ratio, higher pp_ratio means faster inference speed with compress model, and lower pp_ratio with higher delta_ratio retain more model abilities
python cal_params.py \
--model_type mixtral \
--model_path your_mixtral_model_path \
--target_compression_ratio 0.6 \
--pp_ratio 0.2
run mixtral
python D2-mixtral.py \
--control_name=wikitext-2v1_llama-2-7b_clm_20_1024_0.1_ppwandasp_probe-default_sync_c4-2000_0.5+0.05-0.5+0.05-0.5+0.05-0.5+0.05-0.5+0.05-seqrank+bszrank_default \
--base_model_path=your_mixtral_model_path \
--expert_freq_path=your_expert_freq_path \
--fisher_path=your_fisher_path \
--svd_scale_path=your_svd_scale_path \
--result_path=your_result_path \
--pp_ratio=0.2 \
--delta_ratio=0.8 \
--share_ratio=1 \
--merge_method=fisher \
run deepseek
python D2-deepseek.py \
--control_name=wikitext-2v1_llama-2-7b_clm_20_1024_0.1_ppwandasp_probe-default_sync_c4-2000_0.5+0.05-0.5+0.05-0.5+0.05-0.5+0.05-0.5+0.05-seqrank+bszrank_default \
--base_model_path=your_mixtral_model_path \
--expert_freq_path=your_expert_freq_path \
--fisher_path=your_fisher_path \
--svd_scale_path=your_svd_scale_path \
--result_path=your_result_path \
--pp_ratio=0.2 \
--delta_ratio=0.8 \
--share_ratio=1 \
--merge_method=fisher \
If you find D^2-MoE useful in your research, please consider citing the following paper:
@inproceedings{gu2025,
title={D^2-MoE: Delta Decompression for MoE-based LLMs Compression},
author={Gu, Hao and Li, Wei and Li, Lujun and Qiyuan, Zhu and Lee, Mark and Sun, Shengjie and Xue, Wei and Guo, Yike},
year={2025}
}