CLIP-Flow: Decoding images encoded in CLIP space

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Comput. Vis. Media 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This study introduces CLIP-Flow, a novel network for generating images from a given image or text. To effectively utilize the rich semantics contained in both modalities, we designed a semantics-guided methodology for image- and text-to-image synthesis. In particular, we adopted Contrastive Language-Image Pretraining (CLIP) as an encoder to extract semantics and StyleGAN as a decoder to generate images from such information. Moreover, to bridge the embedding space of CLIP and latent space of StyleGAN, real NVP is employed and modified with activation normalization and invertible convolution. As the images and text in CLIP share the same representation space, text prompts can be fed directly into CLIP-Flow to achieve text-to-image synthesis. We conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis method. In addition, we tested on the public dataset Multi-Modal CelebA-HQ, for text-to-image synthesis. Experiments validated that our approach can generate high-quality text-matching images, and is comparable with state-of-the-art methods, both qualitatively and quantitatively.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview