TL;DR: Improving Consistency Training with a learned data-noise coupling.
Abstract: Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-to-data pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT.
Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64x64 with only two sampling steps. Code is available at https://github.com/sony/vct.
Lay Summary: Consistency models learn to generate data in one or few sampling steps, but training them from scratch can be both unstable and slow. In traditional approaches, each data point is paired with randomly sampled noise during training. This strategy, while principled, contributes to the high training variance and can lead to suboptimal results.
To address this, we introduce Variational Consistency Training (VCT), which shares similarities with Variational Autoencoders. Instead of using fixed Gaussian noise, VCT adds a small encoder that learns a data-dependent distribution over the noise. By letting the model itself learn what noise to inject, learning becomes smoother and more stable, because the model receives a better training signal.
In practice, VCT improved results over equivalent baselines with minimal extra cost, achieving SOTA FID on 2-step CIFAR-10, and competitive performance on class-conditional ImageNet at 64×64. Crucially, the encoder adds only a small overhead to training time and leaves one-step sampling speed unchanged, making VCT a simple yet powerful upgrade for fast, high-quality generation.
Link To Code: https://github.com/sony/vct
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Consistency Models, Generative Models
Submission Number: 9882
Loading