Continuous Autoregressive Generation with Mixture of Gaussians

Alex Quach; Tsun-Hsuan Wang; Ramin Hasani; Mathias Lechner; Alexander Amini

Continuous Autoregressive Generation with Mixture of Gaussians

Alex Quach, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continuous Tokenization, Mixture of Gaussians

TL;DR: We introduce a method of continuous-token autoregressive generation using mixture of gaussians

Abstract: Autoregressive sequence models have traditionally relied on discrete tokenizations to leverage cross-entropy training, but this discretization introduces information loss that can be especially costly in high-dimensional domains such as video. Utilizing more information per token enables higher quality generations, allowing one to use less tokens to represent a single image, and thus improve training and inference time. We propose a continuous-token autoregressive framework that parameterizes each step’s output distribution as a mixture of Gaussians. A lightweight Mixture of Gaussians head predicts mixture weights, means, and full covariance factors, and is trained end-to-end by minimizing the Gaussian negative log-likelihood of continuous latent tokens. We demonstrate our approach on conditional video generation from a single image, comparing against a discrete-token and a continuous "mu-only" baseline. Our model achieves the best Frechet Video Distance (FVD), and generates frames with greater temporal diversity, as measured by SSIM components, but with a modest cost to FID.

Submission Number: 158

Loading