# Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory 

The code has been primarily derived from three distinct projects: [nanoGPT](https://github.com/karpathy/nanoGPT), [
structural-grokking](https://github.com/MurtyShikhar/structural-grokking), and [OpenELM](https://machinelearning.apple.com/research/openelm)


It has been designed to be executable on a standard GPU configuration, such as utilizing a single NVIDIA V100 GPU.

## Requirements
- Conda 4.12.0
- Python 3.8.10
- Pytorch 2.0.1
- Matplotlib 3.4.3
- Numpy 1.20.3
- transformers 4.18.0
- datasets 2.6.1
- tiktoken 0.1.1
- wandb 0.12.3
- tqdm 4.62.3
- cudatoolkit 11.3.1

## Data Preprocessing
To preprocess OpenWebText, use `python openwebtext.py`


## Train OpenELM
To train an OpenELM model, use `train_80m.py`


### Hyperparameters
To train the model with different parameters, we have the following optional arguments:
```
  --batch_size
                  batch_size
  --block_size 
                  block_size
  --weight_decay 
                  weight_decay
  --n_layer 
                  number of layers
  --n_head  
                  number of head
  --n_embd
                  embedding dimension
  --min_lr            
                  minimum learning rate
  --device
                  examples: 'cpu', 'cuda'
```



### Train Vanilla Transformers
To train vanilla Transformers: run `train_transformers.py` in [structural-grokking](https://github.com/MurtyShikhar/structural-grokking)
```
# checkpoints saved under /path/to/save/dir
python train_transformers.py --dataset lm --save_dir /path/to/save/dir --encoder_n_layers 6
```  

### Hyperparameters
To train the model with different parameters, we have the following optional arguments:
```
  --dataset      
                  training dataset
  --encoder_n_layers
                  number of layers
```
