Efficient Lossless Text Compression with Large Language Models: Enhancing Cross- Lingual and Cross-Domain Applications

Efficient Lossless Text Compression with Large Language Models: Enhancing Cross- Lingual and Cross-Domain Applications

ACL ARR 2025 May Submission2828 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the era of information explosion, the rapid growth of multilingual and multi-domain textual data poses unprecedented challenges for efficient storage and transmission. Traditional lossless compression methods such as Huffman coding, LZ77, and zlib perform well in certain scenarios but often rely on fixed statistical rules. This limits their ability to capture deeper linguistic structures, especially in complex or domain-specific texts. To address these limitations, we propose two large language model-based lossless text compression methods: DeepSeekZip and LlamaZip, which respectively integrate DeepSeek-8B and Llama3-8B as predictive models with conventional zlib compression. By leveraging the models’ capabilities in modeling complex language patterns, our approach significantly enhances compression performance. Extensive experiments across various languages and text domains demonstrate that DeepSeekZip and LlamaZip consistently achieve over 10\% higher compression rates than zlib alone. Notably, DeepSeekZip performs better in Chinese text compression, while both models show comparable results in English. Furthermore, compression effectiveness varies across domains: news and medical texts are compressed more efficiently than legal and technical ones. This highlights the impact of structure, terminology, and contextual dependencies on compression outcomes.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Text Compression,Language Modeling,Lossless Compression,Large Language Models,Contextual Modeling

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Data resources

Languages Studied: Chinese, English

Keywords: Text Compression, Language Modeling, Lossless Compression, Large Language Models, Contextual Modeling

Submission Number: 2828

Loading