This code is modified and expanded upon from "Selective Attention: Enhancing Transformer through Principled Context Control" GitHub repo: "https://github.com/umich-sota/selective_attention", which is modified from lit-gpt.

```
@misc{lit-gpt-2023,
  author       = {Lightning AI},
  title        = {Lit-GPT},
  howpublished = {\url{https://github.com/Lightning-AI/lit-gpt}},
  year         = {2023},
}

@inproceedings{zhang2024selective,
    title={Selective Attention: Enhancing Transformer through Principled Context Control},
    author={Xuechen Zhang and Xiangyu Chang and Mingchen Li and Amit Roy-Chowdhury and Jiasi Chen and Samet Oymak},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=QbqLcwMXfF}
}
```

To run this, there must be a model downloaded within a checkpoints directory, by running download.py and then convert_hf_checkpoint.py scipts, and data within a data directory, which must be prepared using the prepare_redpajama.py script. 

Using the view_att arguement with the model logs the full causal attention matrix to a logging file. The logging arguement is used to log the normalised Aitchison distance and KL divergence to uniform for the causal row of row lengths specified within the model code. 

To compare the distribution shift between a length extrapolated model and base model, download the 2 models and run seeding_dist.py to log the causal attention matrices, then run ridge_plotting.py on the resultant directories.