# Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

This repository contains code for running language modeling and algorithmic tasks for studying components of the transformer.
    
The code is designed to be modular and scalable, allowing for fast and efficient distributed training.

To launch experiments in various settings, we can use commands such as the following:
    
```
shaped_attention=mixing weight_frozen=0 llama3=True model_name_or_path=llama n_layer=2 eval_steps=1000 logging_steps=1000 max_steps=300 max_seq_len=256 n_embd=512 n_head=8 per_device_train_batch_size=256 learning_rate=5e-4 task=dyckstream_40 ./run_depth_array.sh
```
    
```
shaped_attention=vanilla master_port=29501 llama3=True model_name_or_path=llama n_layer=4 max_steps=10000 max_seq_len=256 n_embd=512 n_head=8 per_device_train_batch_size=128 job_hours=12 n_gpu=2 learning_rate=8e-4 postfix=alpha task=yelp_polarity activation_cminus=-1 weight_decay=0.07 scripts/run_train_classifier.sh
```
    
```
shaped_attention=mixing llama3=True model_name_or_path=llama n_layer=12 max_steps=80000 max_seq_len=256 n_embd=512 n_head=8 per_device_train_batch_size=128 job_hours=24 n_gpu=4 learning_rate=5e-4 task=wikitext scripts/run_train.sh
```
    
To run Frozen-QK or Frozen-MLP, we can use command line arguments `freeze_attention=True` or `freeze_mlp=True`.
        
Note that the code for the algorithmic tasks is based on previous work on the random transformer (Zhong and Andreas), but with more systematized runner scripts and some fixes.

Below are examples of additional arguments that can be passed to training. The complete list can be found in utils.py.
    
```
  --shaped_attention SHAPED_ATTENTION, --shaped-attention SHAPED_ATTENTION
                        Can be shaped, mixing, or vanilla. (default: mixing)
  --weight_frozen [WEIGHT_FROZEN], --weight-frozen [WEIGHT_FROZEN]
                        ONLY for the algorithmic tasks. (default: False)
  --do_rope [DO_ROPE], --do-rope [DO_ROPE]
                        Whether to do rope position embedding. (default: True)
  --no_do_rope, --no-do-rope
                        Whether to do rope position embedding. (default: False)
  --freeze_attention [FREEZE_ATTENTION], --freeze-attention [FREEZE_ATTENTION]
  --freeze_mlp [FREEZE_MLP], --freeze-mlp [FREEZE_MLP]
  --plot_attention_maps [PLOT_ATTENTION_MAPS], --plot-attention-maps [PLOT_ATTENTION_MAPS]
```