# Sparse MoE with Random Routing as the New Dropout: Training Bigger and Self-Scalable Models 






## Prerequisite

- Set-up Environments via: https://github.com/laekov/fastmoe


## Data Prepration

- Get enwik8 dataset: bash getdata.sh
- Download SST-2 dataset by: https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e

## Training:

``` # Table 1: 
# Pretraining Transformer-XL on enwik8: 
bash script/table1/smoe_dropout.sh
bash script/table1/directly_dense_training.sh

# Transfor pretrained model on SST-2:
bash script/table2/sst2/dense_model.sh [pretrained-checkpoint]
bash script/table2/sst2/smoe_dropout.sh [pretrained-checkpoint]

# Ablation:
bash script/figure5/8layers_smoe_dropout.sh
bash script/figure5/12layers_smoe_dropout.sh
```

