# Prune-then-Quantize or Quantize-then-Prune?

This project is a PyTorch implementation of **"Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression"**.
The paper proposes the Progressive Intensity Hypothesis, which posits that neural networks compressed by multiple methods perform better when weaker perturbations are applied first and stronger ones later.

![The Progressive Intensity Hypothesis](./images/hypothesis.jpg)


## Prerequisites

Our implementation is based on PyTorch, accelerate, and transformers libraries.

- Python 3.9.12
- PyTorch 2.2.1
- accelerate 0.27.2
- transformers 4.56.2

We include `requirements.txt`, which contains all the packages used for the experiment. 
We checked the dependency using a workstation with Intel Xeon Gold 6338 and NVIDIA A100 80GB, where its CUDA version was 12.1.
Install the required packages with the following code:

```shell
pip install torch==2.2.1+cu121 
pip install -r requirements.txt
```

### Usage
For usage, execute the python file `src/main.py`.

```shell
cd src
python main.py
```

We include `src/scripts/run.sh`, a running example for SparseGPT and QuaRot. 

```shell
cd src/scripts
bash run.sh
```

To run with different settings, modify the arguments passed into `src/main.py`.

### Code Description

This repository is written based on the codes from **OPTQ** (ICLR '23) \[[Github](https://github.com/IST-DASLab/gptq)\], **SLEB** (ICML '24) \[[Github](https://github.com/jiwonsong-dev/SLEB)\], and **QuaRot** (NeurIPS '24) \[[Github](https://github.com/spcl/QuaRot)\].