STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Generation, Normalizing Flows, Vision-Language Models
TL;DR: We presented STARFlow2, a unified multimodal model that bridges language models and normalizing flows under the same causal Transformer mechanism.
Abstract: Unified multimodal models that understand, reason over, and generate interleaved text–image sequences remain structurally fragmented: existing approaches either sacrifice visual fidelity through discrete tokenization, impose structural asymmetry by combining causal text generation with iterative diffusion-based denoising, or degrade pretrained understanding when adapting vision-language models for generation. We observe that autoregressive normalizing flows are autoregressive Transformers—sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs—making them the most natural paradigm for truly unified multimodal generation that is continuous, single-pass, and purely causal. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a frozen pretrained VLM stream with a TARFlow stream via residual skip connections, both operating under the same causal mask. This design simultaneously preserves pretrained multimodal understanding, enables high-fidelity continuous image generation, and achieves structural unification under a single causal mechanism. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 supports cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 201
Loading