Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration

Juan Camilo Perez; Alejandro Pardo; Mattia Soldan; Hani Itani; Juan C Leon Alcazar; Bernard Ghanem

Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration

Juan Camilo Perez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan C Leon Alcazar, Bernard Ghanem

26 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compressed File Formats, JPEG, Autoregressive Transformers

TL;DR: This study shows that Compressed-Language Models (CLMs) can understand and operate on compressed file formats, like JPEG, by recognizing properties, handling anomalies, and generating files directly from byte streams.

Abstract: This study investigates whether Compressed-Language Models (CLMs), \ie language models operating on raw byte streams from Compressed File Formats (CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Our findings demonstrate that CLMs can effectively perform these tasks. These results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs. The possibility to directly operate on raw compressed files offers the promise to leverage the ubiquitous and multi-modal properties of CFFs.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6183

Loading