Large Language Models Are Stronger Entropy Models for Transform Coding

Fengxi Zhang; Zhengxue Cheng; Yan Zhao; Guo Lu; Qi Wang; Qunshan Gu; Li Song

Large Language Models Are Stronger Entropy Models for Transform Coding

Fengxi Zhang, Zhengxue Cheng, Yan Zhao, Guo Lu, Qi Wang, Qunshan Gu, Li Song

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transform Coding, Multimodal Data Compression, Entropy Model, Large Language Models

TL;DR: We propose a simple yet effective fine-tuning strategy to introduce an LLM-based entropy model for transform coding, tested across various codecs for both images and speeches.

Abstract: Large language models (LLMs) have shown promising advancements in lossless compression due to their excellent next-token prediction capabilities. However, there is a gap between LLM-based compressors and classical transform-based codecs. Existing LLM-based compressors function solely as entropy coders, focusing on compressing redundant data in the raw domain. In contrast, classical codecs typically transform raw data into more compact features in the latent domain before applying entropy coding. But LLM-based compressors have not discussed this case. To our knowledge, this is the first work to introduce an LLM-based entropy model for transform coding. Specifically, we propose a simple yet effective fine-tuning strategy, tested across various codecs for both images and speeches. With less than 2% parameters are fine-tuned, the LLMs can serve as highly effective entropy models for well-established transform-based compression codecs. For instance, LLaMA3-8B paired with arithmetic coding compresses latent image codes on Kodak to 4.62% and speech codes on LibriTTS to 42.53% of their transformed sizes after fine-tuning. Our proposed methods achieve notable BD-rate improvements of 54.07% over JPEG, 17.61% over VQGAN, and 34.61% over SpeechTokenizer. These findings highlight the great potential of integrating LLMs into codecs to significantly improve coding efficiency. Source codes will be released upon acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9792

Loading