

## Requirements

The code depends on Huggingface transformer 4.43.3 version.

```bash
transformers==4.43.3
flash-attn==2.6.3
```

## Installation

Check your correct [PyTorch](https://pytorch.org/) version.

```bash
conda create --name attncomp python=3.12
conda activate attncomp
pip install torch torchvision torchaudio 
pip install -r requirements.txt
python setup.py develop
```

## Quick Start

Update your model_id into the config files: eval/LongBench/config/model2path.json and eval/LongBench/config/model2maxlen.json. 



### Run the pilot experiments

```bash
python needle_probe.py\
 --model model_id\
 --modified gemfilter\ 
 --topk 320 \
 --ctx_len 16000
```
Then select top layer and its topk heads as retrieval heads.

### Run with our configs 
Or you can run the commands with the following hyper-parameters.
```bash
python eval/LongBench/pred.py\
 --model Llama-3.1-8B-Instruct\
 --modified gemfilter\ 
```

We conducted pilot experiments using the "Needle-in-a-Haystack" benchmark across three popular LLMs: **Llama-3.1-8B-Instruct**, **CodeLlama-7B**, and **Phi-3.5-mini-3.8B-Instruct**.

-- For Llama-3.1-8B-Instruct, which has 32 layers and 32 heads, the selected layer is 13, and the chosen heads are $[18,13,21,8,11,1,4,3]$.
-- For {CodeLlama-7B}, also with 32 layers and 32 heads, the selected layer is 14, and the selected heads are $[24,3,18,7,29,2,9,1]$.
-- Finally, for {Phi-3.5-mini-3.8B-Instruct}, which features 32 layers and 32 heads, the selected layer is 17, and the chosen heads are $[7,17,30,2,6,16,25,18]$.

The hyperparameters applied during the evaluation of heads include the size of the observed windows, the pooling operation, and the kernel size for pooling. In all experiments, we used the average pooling operation, as the difference between average pooling and maximum pooling was negligible experimentally.
For {Llama-3.1-8B-Instruct}, we set the size of the observed windows and the kernel size for pooling to 16 and 32, respectively. For {Phi-3.5-mini-3.8B-Instruct}, the size of the observed windows and the kernel size for pooling were set to 4 and 32, respectively. A larger kernel size typically results in a more continuous compressed context, which is why we generally prefer using a larger kernel size.

## Acknowledgments

In this project, we utilized the following open-source code and resources:

- [Gemfilter](https://github.com/SalesforceAIResearch/GemFilter): Special thanks to the authors for providing this excellent code, which has greatly contributed to our project.

We also appreciate the support and contributions from the community, particularly in troubleshooting and providing feedback.




