
# SALS: Sparse Attention in Latent Space for KV Cache Compression

This repository is the official implementation of [SALS: Sparse Attention in Latent Space for KV Cache Compression]. 



## Requirements

1. Create new environment and setup.
```setup
conda create -n SALS python=3.9 -y
conda activate SALS
pip install -r requirements.txt
```

2. Setup third-party enviroments
```
bash set_env.sh
```

## Accuracy Evaluation

We have several accuracy evaluation example in ./example.

## Speed Evaluation

To evaluate latency of attention operator:
```
python benchmark.py
```

To evaluate end-to-end throughput for 4k length:
```
python gpt_fast/generate.py --max_new_tokens 4096



## Acknowledgement

We appreciate the following works for their valuable code and data:

https://github.com/THUDM/LongBench

https://github.com/EleutherAI/lm-evaluation-harness

https://github.com/jy-yuan/KIVI

https://github.com/wuhuaijin/HShare