Keywords: compression, llm
TL;DR: Projected Compression, a method that compresses Transformers using trainable projection modules over frozen base weights.
Abstract: Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose \textbf{Projected Compression}, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional projection weights and preserve access to all the original model parameters. Subsequently, these projections are combined into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the per-token computation cost of training a compressed model. Experimental results show that Projected Compression performs especially well with increasing compression rates as high as 90\% compared to other compression methods.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5914
Loading