STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
Keywords: Multimodal Generation, Normalizing Flows, Vision-Language Models
TL;DR: We presented STARFlow2, a unified multimodal model that bridges language models and normalizing flows under the same causal Transformer mechanism.
Abstract: Unified multimodal models that understand, reason over, and generate interleaved text–image
sequences remain structurally fragmented: existing approaches either sacrifice visual fidelity
through discrete tokenization, impose structural
asymmetry by combining causal text generation
with iterative diffusion-based denoising, or degrade pretrained understanding when adapting
vision-language models for generation. We observe that autoregressive normalizing flows are
autoregressive Transformers—sharing the same
causal mask, KV-cache mechanism, and left-to-right structure as LLMs—making them the most
natural paradigm for truly unified multimodal generation that is continuous, single-pass, and purely
causal. We present STARFlow2, built on the
Pretzel architecture that vertically interleaves a
frozen pretrained VLM stream with a TARFlow
stream via residual skip connections, both operating under the same causal mask. This design
simultaneously preserves pretrained multimodal
understanding, enables high-fidelity continuous
image generation, and achieves structural unification under a single causal mechanism. Combined
with a deep-shallow flow design and a unified
FAE latent space, STARFlow2 supports cache-friendly interleaved generation where both text
and visual outputs directly enter the KV-cache
without re-encoding. Experiments demonstrate
strong performance across image generation and
multimodal understanding benchmarks, validating autoregressive flows as a viable foundation
for unified multimodal modeling.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 201
Loading