# scale-invariant attention

## reproducing figures/tables

to reproduce figures/tables (using saved metrics from our runs), install

```
torch
numpy
matplotlib
scipy
```

and run

```
# all figures
cd /path/to/scale_invariant_attention
cd figure1; python ./plot_entropies.py
cd ../figure2_6; python ./plot.py
cd ../figure3_7_8; python ./plot_pretrain.py
cd ../figure4; python ./plot_med_pretrain.py
cd ../figure5; python ./plot.py
```

```
# tables
cd /path/to/scale_invariant_attention
cd llama2/; python table.py # table 4
cd ../table2/; python ./table.py
cd ../figure3_7_8/; python ./infini.py # necessary data for table 3
cd ../figure3_7_8/; python ./plot_pretrain.py # produces table 1 as well as figures
```

## reproducing the training runs themselves

to execute the runs themselves, for the 80G A100 experiments (`train_gpt2.py`,
`ft_nih_gpt2.py`), we used the following versions of libraries


```
torch==2.5.1
numpy==2.2.2
scipy==1.15.1
```

also install the following for logging

```
wandb
```

to download data, run files in `./data/` with `python`

and then to actually pretrain, we run (for example)

```
python ./train_gpt2.py --runname=rope_4k_s5 --rope --train_seq_length=4096 --seed=5
```

the needle-in-a-haystack runs are executed similarly (but here we resume from an
existing checkpoint created by `train_gpt2.py`). for example,

```
python ./ft_nih_gpt2.py --resume_step=4578 --run_id=ntk_aware_4k_s5__4a6e --ctx_len=4096   --seed=50 --runname=ntk_aware_4k_train4k_s50
```

---

to execute the runs with the larger models, use `train_gpt2_dist.py`.
we used singularity containers to run our code. specifically we used this
image from nvidia: `nvcr.io/nvidia/pytorch:25.04-py3`
we executed on a slurm computing cluster, with scripts like [dist_script.sh](gpt2/scripts/dist_script.sh):

---

to execute the llama2 experiments, see the NoPE example submission script: [sbatch_nope_example.sh](llama2/sbatch_nope_example.batch)