Keywords: normalizing flow, video generation, generative model, autoregressive model
TL;DR: We show for the first time that normalizing flows can be scaled for high-quality video synthesis
Abstract: High-quality video generation at scale requires models that are strictly causal, robust over long horizons, and fast at inference. We present STARFlow-V, a flow-based autoregressive video generator that operates in compressed spatiotemporal latents and is trained with exact likelihood end-to-end. Two design choices ensure causality for autoregressive prediction while mitigating error propagation and enabling end-to-end training: (i) Global–Local architecture, which constrains each token to depend only on the past along time while preserving rich within-frame interactions; and (ii) noise-augmented training jointly with \emph{flow-score matching}, a lightweight causal denoiser that recovers clean samples from noisy generation. To improve efficiency, STARFlow-V employs a video-aware fixed-point iteration scheme that reformulates inner updates as parallelizable iterations without violating causal structure, yielding substantially faster inference. A deep–shallow autoregressive-flow hierarchy further balances capacity and stability over long videos. The same model natively supports both text-to-video (T2V) and text-/image-to-video (TI2V) generation via unified conditioning, avoiding separate pipelines. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with markedly lower sampling cost compared to diffusion-only or discrete AR baselines. By marrying causality, likelihood, and efficiency in a single architecture, STARFlow-V helps pave the way toward a flow-based, scalable paradigm for world modeling.
Primary Area: generative models
Submission Number: 15077
Loading