Efficient Lossless Text Compression with Large Language Models: Enhancing Cross- Lingual and Cross-Domain Applications
Abstract: In the era of information explosion, the rapid growth of multilingual and multi-domain textual data poses unprecedented challenges for efficient storage and transmission. Traditional lossless compression methods such as Huffman coding, LZ77, and zlib perform well in certain scenarios but often rely on fixed statistical rules. This limits their ability to capture deeper linguistic structures, especially in complex or domain-specific texts. To address these limitations, we propose two large language model-based lossless text compression methods: DeepSeekZip and LlamaZip, which respectively integrate DeepSeek-8B and Llama3-8B as predictive models with conventional zlib compression. By leveraging the models’ capabilities in modeling complex language patterns, our approach significantly enhances compression performance. Extensive experiments across various languages and text domains demonstrate that DeepSeekZip and LlamaZip consistently achieve over 10\% higher compression rates than zlib alone. Notably, DeepSeekZip performs better in Chinese text compression, while both models show comparable results in English. Furthermore, compression effectiveness varies across domains: news and medical texts are compressed more efficiently than legal and technical ones. This highlights the impact of structure, terminology, and contextual dependencies on compression outcomes.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Text Compression,Language Modeling,Lossless Compression,Large Language Models,Contextual Modeling
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Data resources
Languages Studied: Chinese, English
Keywords: Text Compression, Language Modeling, Lossless Compression, Large Language Models, Contextual Modeling
Submission Number: 2828
Loading