TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition

Mingxue Xu; Yao Lei Xu; Danilo Mandic

TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition

Mingxue Xu, Yao Lei Xu, Danilo Mandic

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: model compression, low-rank factorization, tensor decomposition

TL;DR: Language model compression based on low-rank factorization for low-end devices.

Abstract: The Small Language Models (SLMs, or on-device LMs) is a concept corresponding to the Large Language Model (LLM), which has significantly fewer parameters and is typically deployed on low-end devices, like mobile phones and single-board computers (e.g. Raspberry Pi). Unlike LLMs, which utilize the increasing model size for better generalization, SLMs are expected to adjust the exact deployment environment changes. Furthermore, most edge applications have battery life concerns, which have never been considered in the GPU servers for data centres. Targeting these two issues, this paper focuses on the token embedding compression for adaptivity and low energy requirements in edge applications. We propose a training-free model compression approach based on the Tensor-Train Decomposition (TTD), whereby each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS). We then comprehensively investigate the low-rank structures extracted by this approach, regarding the compression ratio, language task performance, latency and energy consumption on a typical low-end device (i.e. Raspberry Pi). Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT as examples, the model compressed with our approach can achieve a comparable language task performance to the original model with around $2.0\times$ embedding layer compression, while the energy consumption of single query drops by half.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4925

Loading