#   Mixture of Parrots: Experts improve memorization more than reasoning

## About

This repository gathers the synthetic experiments for the paper [Mixture of Parrots: Experts improve memorization more than reasoning]. It covers two tasks: finding the shortest path in a graph and the closed phone-book lookup.


## Step 1: create the dataset

The first step consists in creating the dataset to train the models. We first choose a number of training, test examples and context length.

In the case of the shortest path task, we need to choose the number of nodes, the probability of assigning an edge between two nodes and whether the graph is directed. For instance to generate a dataset with 1e6 training examples, 1e3 test examples to find the shortest path in a graph with 50 nodes, p = 0.105, we need a context length of 1350 and we can run the following command

```
python create_datasets.py --num_nodes 50
                          --p_edge 0.105
                          --directed 1
                          --num_examples_train int(1e6)
                          --num_examples_test int(1e3)
                          --sequence_length 1350
                          --task graph
```

In the case of the phone-book task, we only need to choose the number of training examples. To generate a phone-book of size 1e6, we do: 

```
python create_datasets.py --num_examples_train int(1e6)
                          --num_examples_test int(1e3)
                          --sequence_length 16
                          --task phone
```

These commands will create a folder ```./datasets``` where the datasets are saved. 


## Step 2: training a model

To train a dense transformer on the graph dataset we generated above, we can run the following command: 


```
python main.py --model dense 
			   --hidden_size 1024
			   --layers 12
			   --heads 16
			   --num_epochs 1
			   --learning_rate 1e-3
			   --weight_decay 0.1
			   --train_batch_size 32
			   --num_nodes 50
			   --p_edge 0.105
			   --directed 1
			   --num_examples_train int(1e6)
			   --task graph
```

To train a MoE on the phone-book dataset we generated above, we can run the following command: 


```
python main.py --model sparse
			   --hidden_size 1024
			   --layers 12
			   --heads 16
			   --num_experts 8
			   --num_epochs 1
			   --learning_rate 1e-3
			   --weight_decay 0.1
			   --train_batch_size 8
			   --num_examples_train int(1e6)
			   --task phone

```

