# NSNQuant
This is an official PyTorch implementation of the paper **NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache**.

Here, we provide some sample commands to reproduce the results from our paper.
For more detailed options, please refer to the config files located in `src/conf`.
Some parts of the code are adapted from [Palu](https://github.com/shadowpa0327/Palu), [KIVI](https://github.com/jy-yuan/KIVI), and [KVQuant](https://github.com/SqueezeAILab/KVQuant).

The centroids for CQ and nuq tables for KVQuant are not included here due to their large sizes.
Instead, you can download [centroids](https://drive.google.com/file/d/1H1QUcMKwzNoXf9MFWwovC2sDjsJHOGvv/view?usp=sharing) and [nuq tables](https://drive.google.com/file/d/10p9r98O92A3lmPof7yCUNYvSNrM0-VtR/view?usp=sharing) to reproduce the results.


## Setup
You have to prepare two separate environments for running `lm_eval` and the other experiments, due to the compatibility of `hydra-core`.
```shell
# General setup
conda create -n NSNQuant python=3.10
conda activate NSNQuant
pip install -r requirements.txt
python setup.py install
cd 3rdparty
cd fast-hadamard-transform
pip install -e .

# Setup for lm_eval
conda create -n lm_eval python=3.10
conda activate lm_eval
pip install -r requirements-lm-eval.txt
python setup.py install
cd 3rdparty
cd fast-hadamard-transform
pip install -e .
cd ../lm-evaluation-harness
pip install -e .
pip install -e .[math]
```
## PPL evaluation
```shell
# Evaluate PPL when applying quantization in the forward pass.
python run_eval_ppl.py model_name_or_path=your_path_to_model quantizer=your_quantizer

# Evaluate PPL in the generation scenario
python run_eval_ppl_generative.py model_name_or_path=your_path_to_model quantizer=your_quantizer
```

## LongBench evaluation
```shell
# Generate predictions
python run_longbench_pred.py model=your_model_name quantizer=your_quantizer

# Run evaluation script to compute metrics based on predictions
python run_longbench_eval.py --model {your_model_name}_{quantizer_postfix}
```

## LM-Evaluation-Harness
```shell
python run_lm_eval.py --task your_task --model_name_or_path your_path_to_model --quantizer your_quantizer
```

## Codebook tuning
```shell
# 2-bit codebook
python generate_codebook.py --method learned --output_path /path/to/your/custom/codebook --abs --save

# 1-bit codebook
python generate_codebook.py --method learned --output_path /path/to/your/custom/codebook --save
```

