Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

David Heurtel-Depeiges; Anian Ruoss; Joel Veness; Tim Genewein

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) — even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.

Lay Summary: It is a well-established fact that neural networks can be used to compress data, including text, images, and audio, much like how zip compresses files. Prior work has shown that current large language models are very effective at compression, even for data they have not encountered during training. However, they are too large to be practically useful as compressors, as one would also need to store the model parameters in the compressed output. As a result, we investigate whether much smaller neural networks of the same type can bridge this gap, i.e., whether they can be used as practical compressors and whether they also generalize to data from unseen domains. We trained small networks on large amounts of data from various sources (text, audio, images) and demonstrated that they can outperform popular compression tools such as gzip, JPEG, and FLAC. However, while these small networks can learn to compress multiple types of data, they are generally incapable of compressing previously unseen data, unlike large language models.

Primary Area: Deep Learning

Keywords: lossless compression, transformers, multimodal

Submission Number: 7034

Loading