Perceptual Image Compression with Text-Guided Multi-level Fusion

Published: 01 Jan 2024, Last Modified: 14 May 2025PRCV (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: While Learned Image Compression (LIC) methods have demonstrated remarkable advancements and have excelled in rate-distortion performance, they may not fully address the subjective perception of viewers. On the contrary, Perceptual Image Compression (PIC) techniques have gained attention due to their focus on preserving visual quality and human perception in compressed images. This paper presents a novel perceptual image compression network that leverages Text-guided Multi-level Fusion (TMF) module to achieve a better rate-perception-distortion trade-off. More precisely, we employ CLIP text encoder to extract textual features, which we then utilize as guidance to perform multiple rounds of fusion with image features. Furthermore, we conduct extensive experiments to quantitatively validate the underlying principles of our algorithm, shed light on the theoretical underpinnings of text guidance, and showcase the effectiveness of our approach from diverse perspectives. Finally, we evaluate our algorithm using state-of-the-art perceptual quality indicators, including CLIP-IQA, NIQE, and FID, and show that our PIC-TMF network outperforms existing algorithms on each metric.
Loading