LLMZip: Lossless Text Compression using Large Language Models

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large Language Models, Transformers, Compression, Arithmetic Coding, Zip, Lossless Text Compression
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: We design a lossless compression algorithm for compressing English text by using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. Specifically, the proposed LLMZip algorithm uses the conditional probabilities at the output of the large language model in conjunction with Arithmetic Coding. We show that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h. We show that it is possible to marginally improve the compression performance further by first extracting a summary from the document and compressing the text by conditioning on the summary. Finally, we investigate the compression performance of LLMZip when the summary (side information) is available both at the encoder and decoder. We show that the LLM is able to exploit the available side information to significantly improve the compression performance. As an important byproduct, we provide new estimates of an asymptotic upper bound on the entropy of English which is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8834
Loading