TL;DR: Direct likelihood approximation of images and captions based on CLIP's learned distribution.
Abstract: Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.
Lay Summary: CLIP is a powerful AI model that connects images and text, but it doesn’t indicate how typical or unusual an image or caption is. In this work, we introduce a simple, training-free method that enables CLIP to estimate the likelihood of an image or caption based on its internal representation. This allows us to detect AI-generated or suspicious images more effectively, assess whether a caption is simple or complex, and identify image domain shifts. Our approach is fast, general, and requires no labeled data, making it valuable for tasks like fake image detection and enhancing AI safety.
Link To Code: https://github.com/rbetser/W_CLIP/tree/main
Primary Area: Deep Learning->Foundation Models
Keywords: CLIP, latent space, likelihood approximation, multi-modal representation
Submission Number: 2094
Loading