Keywords: Multimodality, Generation, Flow Matching, Diffusion, Classifier Free Guidance, Guidance
TL;DR: This paper presents CrossFlows, a new paradigm for multimodal generation that learns a flow on a joint continuous and discrete space.
Abstract: Flow Matching and Diffusion models have achieved impressive feats as generative paradigms for continuous data, such as images and videos, and more recently, for high-dimensional discrete data. Despite this, multimodal generation combining discrete and continuous modalities remains dominated by autoregressive models that work on discrete tokenized inputs or continuous projected embeddings. In this work, we present CrossFlows, a new paradigm for multimodal generation that learns a flow on a joint discrete and continuous space. We show CrossFlows are capable models for multimodal generation, as well as text-to-image, image-to-text generation and other single-modality and multimodal downstream tasks.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 17614
Loading