CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow-Map Models

Published: 26 Jan 2026, Last Modified: 26 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Flow Map Models, Consistency Models, Mean Flow, Mid-Training, Diffusion Model, Generative Models
TL;DR: We introduce Consistency Mid-Training (CMT), a lightweight prior stage that stabilizes flow-map training, reduces training cost by up to 98%, and achieves state-of-the-art FIDs in few-step generation.
Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce *mid-training*, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, *Consistency Mid-Training* (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state-of-the-art two-step FIDs of 1.97 (CIFAR-10), 1.32 (ImageNet $64\times64$), and 1.84 (ImageNet $512\times512$), using up to $98$\% less training data and GPU time than CMs. On ImageNet $256\times256$, it attains 1-step FID 3.34 with $\sim50$\% less training than MF from scratch (FID 3.43). On MSCOCO T2I, CMT reaches the best FID with $\sim47$\% less training. This establishes CMT as a principled, efficient, and general framework for training flow map models. Code and models are available at https://github.com/sony/cmt.
Primary Area: generative models
Submission Number: 8311
Loading