Keywords: Representation learning, flow-matching
TL;DR: We propose an algorithm to train both an encoder that maps images into (structured) latents, and a flow-matching decoder that synthesizes images from such latents.
Abstract: We examine representation learning in the context of continuous-time generative models. When tasked with learning to sample from a distribution of images, flow-matching (and previously denoising diffusion) has been the standard approach, due to the simplicity and stability of training and the diverse, high-quality results. However, unlike previous generative model iterations, such as VAEs, vanilla flow-matching models do not learn reusable representations of the data, i.e., latents that can be easily manipulated, combined, or generally utilized in other tasks. In this work, we propose an algorithm to train both an encoder that embeds images into latents and a flow-matching ``decoder'' model that synthesizes images conditioned on these latents. We find that we can train the encoder with a reinforcement learning objective, utilizing the flow-matching regression loss as a stochastic reward. We modify the RL objective to make the expected reward conditioned on the noise level, allowing the encoder to effectively learn from the intermediate signals obtained by comparing the flow-matching model outputs to a noisy target. Our approach enables unsupervised representation learning of unstructured and \textbf{structured latents}, while also retaining the unmatched sample quality of flow-matching models.
Submission Number: 97
Loading