Stable Diffusion is a Natural Cross-Modal Decoder for Layered AI-Generated Image Compression

Ruijie Chen, Qi Mao, Zhengxue Cheng

Published: 2025, Last Modified: 07 Nov 2025DCC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advances in Artificial Intelligence Generated Content (AIGC) triggered an increasing need to transmit and compress the vast number of AI-generated images (AIGIs). However, there is a noticeable deficiency in research focused on compression methods for AIGIs. To address this critical gap, we advocate that Stable Diffusion serves as a natural cross-modal decoder by leveraging rich and scalable priors, and introduce a scalable cross-modal compression framework that incorporates multiple human-comprehensible modalities. As illustrated in Fig. 1(a), the proposed framework encodes images into a layered bitstream: a semantic prior that delivers high-level semantic information through text prompts; a structural prior that captures spatial details using edge or skeleton maps; and a texture prior that preserves local textures via a colormap. Utilizing Stable Diffusion as the backend, the decoder leverages multi-modal scalable priors to generate images with different levels of fidelity. Experiments show our method preserves realistic details and semantic fidelity at an extremely low bitrate (< 0.02 bpp), comparable with recent perceptual coding approaches and outperforming VVC. The R-D performance also demonstrate the scalability of our proposed multi-layered bitstream since image fidelity incrementally improves with structure and texture priors provided during decoding. Additionally, as illustrated in Fig. 1(b), our framework facilitates downstream editing applications such as Structure Manipulation, Texture Synthesis, and Object Erasing, without requiring full decoding, thereby paving a new direction for future research in AIGI compression.

External IDs:dblp:conf/dcc/Chen0C25