<h1 align='center' style="text-align:center; font-weight:bold; font-size:2.0em;letter-spacing:2.0px;">
               When Attention Sink Emerges in Language Models: An Empirical View </h1>

## Setup
We run all our experiments on A100 GPUs with 40GB memory. To get started, follow these steps:

### Installation
We expect you have CUDA 11.8 installed.
#### Install Pytorch Nightly.
```bash
pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'
```
#### Build XFormers from Source
Note: as of 2023/09/02, xformers does not provide pre-built binaries for torch 2.1. You have to build it from source.
```bash
pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
```


#### Install Flash-Attention 2 and other fused operators:
```bash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention
```
#### Install Remaining Dependencies
```
pip install -r requirements.txt tokenizers sentencepiece
```
to install other dependencies.
It may take >= 5 minutes to build xformers/flash-attention. Do not worry if the process seemly stagnant or the terminal print out many warnings.


### Preprocess

Before training the model, you need to preprocess the data. We provide the easy-to-use script for preprocessing the data. You can use the following command to preprocess the data:

```shell
cd preprocess
bash run_preprocess.sh
```

By default you will first download the `regmix-data-sample` from the HuggingFace and then preprocess the data. The JSONL data will be saved in the `preprocess/sail/regmix-data-sample` directory, and the preprocessed data will be saved in the `lit_dataset_regmix` directory.


## Wandb Integration

By default we use the wandb for collecting the data to avoid saving massive small models and logs on the local machine. If you want to use the wandb, you need to create an account on the [wandb](https://wandb.ai/site) and get the API key. Then you should set the following environment variable in `run_default.sh`, `run_kv_bias.sh`, `run_sigmoid.sh`:

```shell
# wandb project name, entity, and API key
export WANDB_PROJECT=YOUR_PROJECT_NAME
export WANDB_ENTITY=YOUR_WANDB_ENTITY
export WANDB_API_KEY=YOUR_WANDB_API_KEY
```

## LM Pre-training



### Default setup
```shell
bash scripts/run_default.sh
```

The final checkpoint is located at `checkpoints/tinyllama_60M/iter-020000-ckpt.pth`. 

### KV biases setup
```shell
bash scripts/run_kv_bias.sh $BIAS
```

`$BIAS` can be `kv_head_bias`, `kv_bias`, `k_head_bias`, `k_head_bias`, `v_head_bias`, `v_bias`, corresponding to KV biases, K biases and V biases (with / without head-sharing patterns) in our main paper. 

The final checkpoint is located at `checkpoints/tinyllama_60M_{$BIAS}/iter-020000-ckpt.pth`. 

### Pre-train LMs with sigmoid attention (without normalization)
```shell
bash scripts/run_sigmoid.sh
```

The final checkpoint is located at `checkpoints/tinyllama_60M_sigmoid/iter-020000-ckpt.pth`. 


## Attention sink and massive activations evaluation

```shell
python eval_attention_sink.py
```