TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices

Published: 10 Jun 2025, Last Modified: 01 Jul 2025TTODLer-FM @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model compression, natural language processing, low-end device, energy efficiency, tensor decomposition
Abstract: Small Language Models (SLMs, or on-device LMs) have significantly fewer parameters than Large Language Models (LLMs). They are typically deployed on low-end devices, like mobile phones and single-board computers. Unlike LLMs, which rely on increasing model size for better generalisation, SLMs designed for edge applications are expected to have **adaptivity** to the deployment environments and **energy efficiency** given the device battery life constraints, which are not addressed in datacenter-deployed LLMs.This paper addresses these two requirements by proposing a training-free token embedding compression approach using Tensor-Train Decomposition (TTD). Each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS). We comprehensively evaluate the extracted low-rank structures across compression ratio, language task performance, latency, and energy consumption on a typical low-end device, i.e. Raspberry Pi. Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT models as examples, our approach achieves a comparable language task performance to the original model with around $ 2.0\times $ embedding layer compression, while the energy consumption of a single query drops by half.
Submission Number: 20
Loading