LLM Vocabulary Compression for Low-Compute Environments

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vocabulary compression, low compute environments, language modelling
TL;DR: We present a simple method that compresses the vocabulary layer in language models resulting in significantly lower memory usage.
Abstract: We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialization of the memory-intensive logits tensor. Evaluations on the TinyStories dataset show that our method performs on par with GPT-Neo while significantly improving throughput by up to 3x, making it suitable for low-resource environments.
Submission Number: 105
Loading