Keywords: representation, sparsity, unsupervised, semantic, denoising
TL;DR: The innermost channels of a convolutional U-Net, trained unsupervised for denoising, provide a sparse representation of semantically meaningful image patterns.
Abstract: Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine the image representation that arises from score estimation in a fully-convolutional unconditional UNet. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. Euclidean distances in this representation space are semantically meaningful, even though no conditioning information is provided during training. We develop a novel algorithm for stochastic reconstruction of images conditioned on this representation: The synthesis using the unconditional model is "self-guided" by the representation extracted from that very same model. For a given representation, the common patterns in the set of reconstructed samples reveal the features captured in the middle block of the UNet. Together, these results show, for the first time, that a measure of semantic similarity emerges, *unsupervised*, solely from the denoising objective.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 22654
Loading