End-to-End Video Generative Modeling with Scalable Normalizing Flows

Jiatao Gu; Ying Shen; Tianrong Chen; Laurent Dinh; Yuyang Wang; Miguel Ángel Bautista; David Berthelot; Joshua M. Susskind; Shuangfei Zhai

End-to-End Video Generative Modeling with Scalable Normalizing Flows

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Ángel Bautista, David Berthelot, Joshua M. Susskind, Shuangfei Zhai

19 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: normalizing flow, video generation, generative model, autoregressive model

TL;DR: We show for the first time that normalizing flows can be scaled for high-quality video synthesis

Abstract: High-quality video generation at scale requires models that are strictly causal, robust over long horizons, and fast at inference. We present STARFlow-V, a flow-based autoregressive video generator that operates in compressed spatiotemporal latents and is trained with exact likelihood end-to-end. Two design choices ensure causality for autoregressive prediction while mitigating error propagation and enabling end-to-end training: (i) Global–Local architecture, which constrains each token to depend only on the past along time while preserving rich within-frame interactions; and (ii) noise-augmented training jointly with \emph{flow-score matching}, a lightweight causal denoiser that recovers clean samples from noisy generation. To improve efficiency, STARFlow-V employs a video-aware fixed-point iteration scheme that reformulates inner updates as parallelizable iterations without violating causal structure, yielding substantially faster inference. A deep–shallow autoregressive-flow hierarchy further balances capacity and stability over long videos. The same model natively supports both text-to-video (T2V) and text-/image-to-video (TI2V) generation via unified conditioning, avoiding separate pipelines. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with markedly lower sampling cost compared to diffusion-only or discrete AR baselines. By marrying causality, likelihood, and efficiency in a single architecture, STARFlow-V helps pave the way toward a flow-based, scalable paradigm for world modeling.

Primary Area: generative models

Submission Number: 15077

Loading