Transition Matching: Scalable and Flexible Generative Modeling

Neta Shaul; Uriel Singer; Itai Gat; Yaron Lipman

Transition Matching: Scalable and Flexible Generative Modeling

Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: generative models, flow matching, diffusion, large scale, multimodal

TL;DR: Transition matching (TM) is a discrete-time continuous-state generative modeling that advances both flow/diffusion and autoregressive models. TM variants achieve state-of-the-art text-to-image generation.

Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation, showing improved performance at scale. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 12964

Loading