Implicit Inversion turns CLIP into a Decoder

Implicit Inversion turns CLIP into a Decoder

ICLR 2026 Conference Submission18213 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Inversion, Text To Image, CLIP, Implicit Neural Representations

TL;DR: We turn the discriminative CLIP model into a generator capable of addressing multiple tasks—such as text-to-image synthesis, style transfer, and image editing—without fine-tuning

Abstract: CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. We show that image synthesis is nevertheless possible using CLIP alone—without a pre-trained generative decoder or CLIP tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. With CLIP frozen, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. Our findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

Primary Area: optimization

Submission Number: 18213

Loading