Metadata-Version: 2.1
Name: llmtuner
Version: 0.5.2
Summary: Easy-to-use LLM fine-tuning framework
Home-page: https://github.com/hiyouga/LLaMA-Factory
Author: hiyouga
Author-email: hiyouga@buaa.edu.cn
License: Apache 2.0 License
Keywords: LLaMA,BLOOM,Falcon,LLM,ChatGPT,transformer,pytorch,deep learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch==2.3.1
Requires-Dist: torchaudio==2.3.1
Requires-Dist: torchvision==0.18.1
Requires-Dist: transformers==4.38.1
Requires-Dist: accelerate==0.29.2
Requires-Dist: datasets==2.17.1
Requires-Dist: deepspeed==0.13.3
Requires-Dist: sentencepiece==0.2.0
Requires-Dist: trl==0.7.11
Requires-Dist: peft==0.8.2
Requires-Dist: ninja
Requires-Dist: fastchat

# Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

**[Shwai He](https://shwai-he.github.io/)\*, [Daize Dong](https://daizedong.github.io/)\*, [Liang Ding](https://liamding.cc/), [Ang Li](https://www.ang-li.com/)**

> **This is the official implementation of the paper [Demystifying the Compression of Mixture-of-Experts Through a Unified Framework](https://arxiv.org/abs/2406.02500).** We provide a comprehensive framework for compressing Mixture-of-Experts models. 



## Introduction

The Mixture of Experts (MoE) approach dynamically selects and activates only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Since the compression of MoE remains under-explored, we address this gap with a cutting-edge unified framework that seamlessly integrates mainstream compression methods and helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming, which compresses individual experts, and Expert Trimming, which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods
and introduce aggressive Expert Trimming techniques, such as Layer Drop and Block Drop, to eliminate redundancy on a larger scale. Based on these insights, we present a comprehensive recipe to guide practitioners in effectively compressing MoE.

![unified-view.svg](unified-view.svg)

![unified-view-table.svg](unified-view-table.svg)



## Installation

#### Environment

Create conda environment and install the pipeline for pruning and Expert Trimming (based on the [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)).

```bash
conda create -n moe-compression python=3.10
conda activate moe-compression

git clone git@github.com:DaizeDong/Unified-MoE-Compression.git
cd ./Unified-MoE-Compression
pip install -e .
pip install flash-attn --no-build-isolation
```

Install the pipeline for quantization (based on the [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) and [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ)). Ensure you carefully install the packages that correspond to your CUDA version. For more details you can refer to the README files in corresponding folders.

```bash
cd ./AutoAWQ
pip install -e .

cd ./AutoAWQ/AutoAWQ_kernels
pip install -e .

cd ./AutoGPTQ
pip install -vvv --no-build-isolation -e .
```



#### Prepare Models

Download the [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) and [DeepSeek-MoE-16B](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base) model from HuggingFace, and **delete** the following lines in the `config.json` of DeepSeek-MoE-16B.

```json
"auto_map": {
  "AutoConfig": "configuration_deepseek.DeepseekConfig",
  "AutoModel": "modeling_deepseek.DeepseekModel",
  "AutoModelForCausalLM": "modeling_deepseek.DeepseekForCausalLM"
},
```



## Running Compression

### Expert Slimming

#### Pruning

```bash
bash scripts/compression/pruning/mixtral_prune.sh
bash scripts/compression/pruning/deepseek_prune.sh
bash scripts/compression/pruning/deepseek_prune_noshared.sh
```

#### Quantization

```bash
bash scripts/compression/quantization/awq.sh
bash scripts/compression/quantization/gptq.sh
```



### Expert Trimming

#### Expert Drop

```bash
bash scripts/compression/expert_drop/mixtral_expert_drop.sh
bash scripts/compression/expert_drop/deepseek_expert_drop.sh
```

#### Layer Drop

```bash
bash scripts/compression/layer_drop/mixtral_layer_drop.sh
bash scripts/compression/layer_drop/deepseek_layer_drop.sh
```

#### Block Drop

```bash
bash scripts/compression/block_drop/mixtral_block_drop.sh
bash scripts/compression/block_drop/deepseek_block_drop.sh
```



## Running Evaluation

#### FLOPs & Speed

```bash
bash scripts/evaluation/speedup/measure_flops.sh
bash scripts/evaluation/speedup/measure_speed.sh
```

#### Loss & PPL

```bash
bash scripts/evaluation/loss/mixtral_evaluate.sh
bash scripts/evaluation/loss/deepseek_evaluate.sh
```

#### Benchmarks

Coming soon. We are still cleaning the code...

Fow now please refer to [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

Remember to use the modeling files in `src/llmtuner/model` to load the [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) and [DeepSeek-MoE-16B](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base) models.



## Citation

```latex
@article{he2024demystifying,
  title={Demystifying the Compression of Mixture-of-Experts Through a Unified Framework},
  author={He, Shwai and Dong, Daize and Ding, Liang and Li, Ang},
  journal={arXiv preprint arXiv:2406.02500},
  year={2024}
}
```



## Contact Us

If you have any questions, please contact:

- Shwai He: shwaihe@umd.edu

- Daize Dong: dzdong2019@gmail.com
