MoDeGPT: Modular Decomposition for Large Language Model Compression

Chi-Heng Lin; Shangqian Gao; James Seale Smith; Abhishek Patel; Shikhar Tuli; Yilin Shen; Hongxia Jin; Yen-Chang Hsu

MoDeGPT: Modular Decomposition for Large Language Model Compression

Chi-Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

Published: 22 Jan 2025, Last Modified: 01 Mar 2025ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, model compression, matrix decomposition

TL;DR: A framework that expands matrix decomposition for LLM compression beyond SVD

Abstract: Large Language Models (LLMs) have significantly advanced AI with their exceptional performance across a wide range of tasks. However, their extensive computational requirements restrict their use on devices with limited resources. While recent compression methods based on low-rank matrices show potential solutions, they often suffer from significant loss of accuracy or introduce substantial overhead in parameters and inference time. In this paper, we introduce Modular De- composition (MoDeGPT), a new, efficient, and structured compression framework that overcomes these limitations. MoDeGPT jointly decomposes pairs of consecu- tive subcomponents within Transformer blocks, reduces hidden dimensions through output reconstruction on a larger structural scale than conventional low-rank meth- ods, and repurposes three classical matrix decomposition algorithms—Nyström approximation, CR decomposition, and SVD—to ensure bounded errors in our novel decomposition approach. Our experiments show that MoDeGPT, without relying on backward propagation, consistently matches or surpasses the performance of prior techniques that depend on gradient information, while achieving a 98% reduction in compute costs when compressing a 13B-parameter model. On LLaMA-2/3 and OPT models, MoDeGPT retains 90-95% of zero-shot performance with compression rates of 25-30%. The compression process can be completed on a single GPU in a few hours, boosting inference throughput by up to 46%.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9191

Loading