# MoESD: Unveil Speculative Decoding’s Potential for Accelerating Sparse MoE

This repository provides the original data and reproduce instructions for our paper.

## Environment & Platform

### Python Environment

We use `conda` to manage python environments, the dependencies are recorded in `requirements.txt`. You may use the following instructions to create a conda environment. 

```bash
conda create -n moesd python=3.10
conda activate moesd
pip install -r requirements.txt
```

### Models/Datasets Preparation

We do experiments of SD (speculative decoding) on both MoE and dense models. We use Qwen2 family and opt family as representitives for MoE models and dense models respectively:
- [Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct) as the **MoE** target model.
- [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as corresponding draft model.
- [opt-30b](https://huggingface.co/facebook/opt-30b) as the **dense** target model.
- [opt-350m](https://huggingface.co/facebook/opt-350m) as corresponding draft model.

 
 Please first download these models to the local machine, and then make symbolic links in the `models` directory. The commands are as follows:

**Download Models**:
```bash
huggingface-cli download Qwen/Qwen2-57B-A14B-Instruct --local-dir your_download_path/Qwen2-57B-A14B-Instruct
huggingface-cli download Qwen/Qwen2-0.5B-Instruct --local-dir your_download_path/Qwen2-0.5B-Instruct
huggingface-cli download facebook/opt-30b --local-dir your_download_path/opt-30b
huggingface-cli download facebook/opt-350m --local-dir your_download_path/opt-350m
```

**Make soft link for models**:
```bash
ln -s your_download_path/Qwen2-57B-14A-Instruct ./models/
ln -s your_download_path/Qwen2-0.5B-Instruct ./models/
ln -s your_download_path/opt-30b ./models/
ln -s your_download_path/opt-350m ./models/
```

We conducted experiments on two datasets:
- [mtbench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) is a conversation dataset
- [humaneval](https://huggingface.co/datasets/openai/openai_humaneval) is a code generation dataset. 

Please first download these datasets to the local machine, and make build symbolic links in the `datasets` directory, pointing to the downloaded paths. The commands are as follows:

**Download Datasets**:
```bash
huggingface-cli download HuggingFaceH4/mt_bench_prompts --repo-type dataset --local-dir your_download_path/mt_bench_prompts
huggingface-cli download openai/openai_humaneval --repo-type dataset --local-dir your_download_path/openai_humaneval
```

**Make soft link for datasets**:
```bash
ln -s your_download_path/mt_bench_prompts ./datasets/
ln -s your_download_path/openai_humaneval ./datasets/
```

### Hardware Platform

To run the 57B model without quantization (namely, FP16 precision), at least **2 GPUs with 80GB memory** or **4 GPUs with 40GB memory** are required to store the entire model.

The experiments in our paper were conducted on 2xA800 and 2xH800, taking approximately 3~5 days to complete (depending on whether the model was saved on SSD or HDD).


## Run Experiments

Since our paper does not propose new algorithms, but mainly runs experiments on existing frameworks to validate theoretical analysis, we provide two methods to verify the experimental results:

* Method 1: Run visualization and analysis using **our provided data obtained from our experiments**. Method 1 could run on machines with limited GPU resources.
  
* Method 2: Re-run the complete experiments. First **generate your own data**, and then perform visualization and validation of theoretical analysis based on your generated data. Method 2 requires machines with sufficient GPU resources.

**NOTE:** if shell scripts (`*.sh`) is not marked executable, you may run `chmod +x ./*.sh` first.

## Method 1 (using our provided data)


### Step 1

Step 1 runs under `./moe` directory. The results are for `Tab. 1` & `Fig. 2(a) & 2(b)` of this paper.

Step 1 show SD (speculative decoding) results on MoE models.

The original data used by this script is stored in `csv_results_example*` directory.

After running, results are saved in `plot_example` directory.

```bash
sh ./post_process_example.sh
```

### Step 2

Step 2 runs under `./dense` directory. The results are for `Fig. 2(c)` of this paper.

Step 2 show SD results on dense models.

After running, results are saved in `plot_example` directory.

```bash
sh ./post_process_example.sh
```

### Step 3

Step 3 runs under `./sparsity` directory. The results are for `Fig. 3` of this paper.

Step 3 show SD results on MoE models with different sparsity.

After running, results are saved in `plot_example` directory.

```bash
sh ./post_process_example.sh
```

### Step 4

Step 4 runs under `./modeling` directory. The results are for `Alg. 1`, `Fig. 3`, and results in `Appendix B` of this paper.

Step 4 show results on fitting and the modeling of SD speedup.

After running, results are saved in `plot_example` directory.

```bash
sh ./run_example.sh
```

## Method 2 (generate your own data)

### Step1

Step 1 shall run under `./moe` directory.

First run `run.sh` to generate data，then run `post_process.sh` for visualization and summary report.

```bash
CUDA_VISIBLE_DEVICES=0,1 sh ./run.sh
sh ./post_process.sh
```

You may tweak `run.sh` to run with different parameters (e.g. change draft length, change tensor parallelism size, etc). Pleaes refer to comments in `run.sh` for details.

After running `run.sh`, this script will generate `csv_results`, `log_results` directories. `csv_results` and `log_results` contains `prefill`, `sd`, `ar` three subdirectories.
- `csv_results` contain performance data and configureations during running
- `log_results` record output to terminals of this script. 

After running `post_process.sh`, this script will generate `plot` directory in `./moe/`, and `summary` directory in `./moe/csv_results/`. `plot` contains figures like `Fig. 2(a) & 2(b)` in this paper.

**NOTE**: Logs will contain absolute path.  To avoid information leak for double-blind review, we do not provide a example `log_results` output.


### Step2

Step 1 shall run under `./dense` directory.

Note: **Step 2 depend on step 1**. Please finish step 1 first.

Step 2 is similar to step 1, but for dense models. Step 2 utilizes MoE results of step 1 for comparison, thus it depends on step 1.

The running instruction of step 2 is the same with those of step 1. The results are saved in `plot` directory, same as step 1.

```bash
CUDA_VISIBLE_DEVICES=0,1 sh ./run.sh
sh ./post_process.sh
```

### Step3

Step 1 shall run under `./sparsity` directory.

Note: **Step 3 depend on step 1**. Please finish step 1 first.

Step 3 shows speedup of SD under different sparsity MoE. Step 3 uses results of `activated_experts=8` of step 1.

The running instruction of step 3 is the same with those of step 1. The results are saved in `plot` directory, same as step 1.

```bash
CUDA_VISIBLE_DEVICES=0,1 sh ./run.sh
sh ./post_process.sh
```

### Step4

Step 4 shall run under `./modeling` directory.

Note: **Step 4 depend on step 3**. Please finish step 3 first.

The results are saved in `plot` directory. 

```bash
sh ./run.sh
```