RECAP: Training-Free Compensation for Coarse Activation Channel Pruning in Compressed LLMs

Mingyu Lee; Akshat Ramachandran; Tushar Krishna

RECAP: Training-Free Compensation for Coarse Activation Channel Pruning in Compressed LLMs

Mingyu Lee, Akshat Ramachandran, Tushar Krishna

Published: 21 May 2025, Last Modified: 17 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Presentation: In-Person

Keywords: Sparsity, Error Compensation, LLM, Efficient AI, Channel Pruning

Presenter Full Name: Mingyu Lee

TL;DR: RECAP is a training-free method that recovers up to 34% accuracy in channel-pruned LLMs by reusing pruning statistics for lightweight error compensation, enabling hardware-friendly compression without sacrificing model quality.

Presenter Email: mlee864@gatech.edu

Abstract: Sparsity is a key enabler for efficient inference in large language models (LLMs). While a wide spectrum of sparsification techniques—from unstructured to highly structured—have been explored to reduce computational overhead, they often involve trade-offs between hardware efficiency and model accuracy. Channel sparsity, in particular, is appealing due to its hardware-friendly structure compared to alternatives like structured N:M sparsity, but suffers from notable accuracy degradation, especially when applied to activations. To bridge this gap, we propose RECAP, a lightweight, training-free compensation method that mitigates the impact of channel pruning induced errors. RECAP exploits the statistics of the pruned channel as a representation of the sparsity-induced error and transfers it to the corresponding weights to compensate for the removal of the channel. Extensive experiments across diverse LLM families and benchmarks demonstrate that RECAP outperforms existing alternatives at all sparsity levels. On LLaMA3-8B, RECAP achieves approximately a 34\% improvement in 0-shot BoolQ benchmark accuracy at a target sparsity ratio of 70\%.

Presenter Bio: Mingyu Lee is an undergraduate researcher at the Synergy Lab, Georgia Institute of Technology, advised by Prof. Tushar Krishna. His research focuses on accelerating AI applications through hardware/software co-design for efficient and practical real-world deployment.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

YouTube Link: https://youtu.be/7xnN1NuxD7Q

YouTube Link Poster: https://youtu.be/cdB1QR_CbxM

Google Slides: https://docs.google.com/presentation/d/1iNCQBLedMH5OKLX9gA8nPRx3iDFZOhTbyKrlZR11G9E/edit?usp=sharing

Poster: Yes

Workshop Registration: Yes, the presenter has registered for the workshop.

YouTube Link Short: TBA

Submission Number: 3

Loading