# Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

This project aims to develope an imporved MoE framework that allows

- automatically decided and adjusted number of experts
- automatically decided and varied number of experts to be activated for each input token

##  Directory Specification

### Experiment Code

- `EMoE/` contains experiments on language and vision tasks, which uses tutel-based DynMoE.
- `MoE-LLaVA/` contains experiments on language-vision tasks, which uses deepspeed-0.9.5-based DynMoE.

### DynMoE Implementations

- `Deepspeed/` provides DynMoE-Deepspeed implementation.
- `EMoE/tutel/` provides DynMoE-Tutel implementation.

### MISC

- `Vis/` contains visualization data and code.
- `ContainerConfig/` provides base conda configuration files.

## Environment Setup

Please refer to instructions under `EMoE/` and `MoE-LLaVA`.

## Usage

### Tutel Examples

Please refer to `EMoE/Language/README.md` and `EMoE/Language/Vision.md`.

### DeepSpeed Examples

Network Configuration

```python
deepspeed.moe.layer.MoE(
  hidden_size=84,
  expert=fc3,
  num_experts=n_e // 2,
  ep_size=args.ep_world_size,
  use_residual=args.mlp_type == "residual",
  k=-1, # -1 means using DynMoE
  min_capacity=args.min_capacity,
  noisy_gate_policy=args.noisy_gate_policy,
  max_expert_num=n_e
)
```

Training model forward, you can control the adaptive process by using `if_begin_record_routing`, `if_end_record_routing`.

```python
outputs = model_engine(inputs, if_begin_record_routing=True, if_end_record_routing=True)
```

## Acknowledgement
We are grateful for the following awesome projects:

- [tutel](https://github.com/microsoft/tutel)
- [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- [GMoE](https://github.com/Luodian/Generalizable-Mixture-of-Experts)
- [EMoE](https://github.com/qiuzh20/EMoE)
- [MoE-LLaVA](https://github.com/PKU-YuanGroup/MoE-LLaVA)
- [GLUE-X](https://github.com/YangLinyi/GLUE-X)
