# DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs (NeurIPS 2025)

We propose DuoGPT, a training-free pruning framework that integrates activation sparsity into OBC framework to enable efficient dual-sparse LLM inference with state-of-the-art accuracy–efficiency trade-offs and scalability.

<h1 align="center">   
    <img src="./img/intro.png" width="1000">  
</h1>  

Code implementation of DuoGPT (NeurIPS 2025)

Paper link: https://arxiv.org/abs/2506.20194

### News
- [09/2025] DuoGPT is accepted to NeurIPS 2025.

### Abstract
Large language models (LLMs) deliver strong performance but are difficult to deploy due to high memory and compute costs. While pruning reduces these demands, most methods ignore activation sparsity observed at runtime. We reinterpret activation sparsity as dynamic structured weight sparsity and propose **DuoGPT**, a unified framework that constructs dual-sparse (spMspV) workloads by combining unstructured weight pruning with activation sparsity. To preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with activation-aware calibration and introduce output residuals from the dense model as correction terms. We further optimize the solution for efficient GPU execution, enabling scalability to billion-parameter LLMs. Evaluations on LLaMA-2 and LLaMA-3 show that **DuoGPT** outperforms state-of-the-art structured pruning methods by up to 9.17\% accuracy at an iso-speedup of 1.39x compared to the baseline dense model.

### Environment / Dependencies
Below is the dependencies for running the algorithm of DuoGPT (under the dir of `/fake_prune`).

- lm-evaluation-harness version: **0.4.0** (commit `c9bbec6`)  
- Python version: **3.9.19**
- PyTorch version: **2.4.0**  
- transformers version: **4.49.0**

All other packages required can be found under ``/fake_prune/requirements.txt``

### Install
1. Clone the repo and navigate to DuoGPT:
```
git clone https://github.com/RuokaiYin/DuoGPT.git
cd DuoGPT
```

2. Set up the environment:

```
conda create -n duogpt python=3.9.19
conda activate duogpt
cd fake_prune
pip install requirements.txt
```

### Basic Usage

For a quick start, we provide an example running script of pruning a 50\% dual-sparse LLaMA-2-7B model and evaluate both the PPL and standard 0-shots down stream tasks.

1. Navigate to the fake_prune directory if not already: ``cd fake_prune``

2. (Optional) Set the cache path if you have limited storage space in your local directory. `utils.py: line 126`, `data_utils.py: line 5`, and `model_utils.py: line 17`

3. Run the example script: ``./run_example.sh``. Grant the execution permission if required.

4. (Optional) Turn on the flag of ``--save_ckpt`` if you want to store the checkpoint. This flag requires you provide the ``--wandb_name`` which will be the check point's name. It is recommended to set the checkpoint path as well (``utils.py: line 120``). Using the weight and bias is highly recommended as well for tracking the experiments.


We test the code again before release to ensure the reproducibility of the main results. But there can still be small deviations due to the neumerical variations, hardware precisions, and dataset versions variations. So please treat the current repo as just a proof of concept of the dual-sparse LLM workloads.

We will keep adding more running examples.

### Acknowledgements

DuoGPT's method of using asymmetric calibration error to compensate for the activation sparsity is insipired from its cousin quantization project: 

GPTAQ: Efficient Finetuning-Free Quantization with Asymmetric Calibration (ICML 2025)

Repo link: https://github.com/Intelligent-Computing-Lab-Panda/GPTAQ

Paper link: https://arxiv.org/abs/2504.02692

Huge thanks to my coauthor Yuhang Li (@yhhhli) for his contribution to the DuoGPT project.


Our repository contains the codes modified from several other great repositories:

https://github.com/IST-DASLab/sparsegpt

https://github.com/locuslab/wanda

https://github.com/FasterDecoding/TEAL



### Contact and Citations
Ruokai Yin (ruokai.yin@yale.edu)

If you find our work useful, please consider giving a star and citation:

```bibtex 
@article{yin2025duogpt,
  title={DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs},
  author={Yin, Ruokai and Li, Yuhang and Lee, Donghyun and Panda, Priyadarshini},
  year={2025},
  journal={arXiv preprint arXiv:2506.20194}
}
```
