On Fitting Flow Models with Large Sinkhorn Couplings

Michal Klein; Alireza Mousavi-Hosseini; Stephen Y. Zhang; marco cuturi

On Fitting Flow Models with Large Sinkhorn Couplings

Michal Klein, Alireza Mousavi-Hosseini, Stephen Y. Zhang, marco cuturi

11 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: flow matching

TL;DR: a new great way to train flow matching

Abstract: Flow models transform data gradually from a modality (e.g. noise) onto another (e.g. images). Such models are parameterized by a time-dependent velocity field, trained to fit segments connecting pairs of source \& target points. When a pairing between source and target points is known, the training boils down to a supervised regression problem. When no such pairing exists, as is the case when generating data from noise, training flows is much harder. A popular approach lies in picking in that case source and target points independently. This can, however, lead to velocity fields with high variance that are difficult to integrate. In theory, one would greatly benefit from training flow models by sampling pairs from an optimal transport (OT) measure coupling source and target, since this would lead to a highly efficient flow solving the Benamou-Brenier dynamical OT problem. Practically, recent works have proposed to sample mini-batches of $n$ source and $n$ target points and reorder them using an OT solver to form better pairs. These works have advocated using batches of size $n\approx 256$, and considered couplings that are both ``hard'' (permutations obtained with the Hungarian algorithm) or soft (computed with the Sinkhorn algorithm). We follow in the footsteps of these works by exploring the benefits of increasing this mini-batch size $n$ by several orders of magnitude, and look more carefully on the effect of the entropic regularization $\varepsilon$ used in Sinkhorn. Our analysis and computations are facilitated by new scale invariant quantities to present results and sharded computations parallelized over multiple GPU nodes. We uncover a markedly different regime where flow matching does benefit from OT guiding, as long as it is properly scaled to large $n$ and suitable entropic regularization $\varepsilon$.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 18452

Loading