Keywords: Low rank decomposition, optimisation, Compression, Code LLMs
TL;DR: Exploring Low Rank Decomposition (LoRD) as a promising new paradigm for compressing large language models for coding, this paper also discusses the compatibility of LoRD with other compression techniques like pruning and quantization.
Abstract: We propose using low-rank matrix decomposition (LoRD), which splits a large matrix into a product of two smaller matrices, to compress neural network models and thereby enhance inference speed. Unlike quantization, LoRD maintains fully differentiable, trainable parameters and leverages efficient floating-point operations. We investigate its advantages for compressing Large Language Models (LLMs) for monolingual code generation, demonstrating that linear layer ranks can be reduced by up to 39.58% with less than a 1% increase in perplexity. Specifically, we use LoRD to compress the StarCoder 16B model to 13.2B parameters with no performance drop and to 12.3B parameters with minimal performance drop in the HumanEval Pass@1 score, all within 10 minutes on a single A100 GPU. The compressed models achieve up to a 22.35% inference speedup with just a single line of code change in HuggingFace’s implementation with Pytorch backend.
Submission Number: 30
Loading