Keywords: Continuous Tokenization, Mixture of Gaussians
TL;DR: We introduce a method of continuous-token autoregressive generation using mixture of gaussians
Abstract: Autoregressive sequence models have traditionally relied on discrete tokenizations to leverage cross-entropy training, but this discretization introduces information loss that can be especially costly in high-dimensional domains such as video. Utilizing more information per token enables higher quality generations, allowing one to use less tokens to represent a single image, and thus improve training and inference time. We propose a continuous-token autoregressive framework that parameterizes each step’s output distribution as a mixture of Gaussians. A lightweight Mixture of Gaussians head predicts mixture weights, means, and full covariance factors, and is trained end-to-end by minimizing the Gaussian negative log-likelihood of continuous latent tokens. We demonstrate our approach on conditional video generation from a single image, comparing against a discrete-token and a continuous "mu-only" baseline. Our model achieves the best Frechet Video Distance (FVD), and generates frames with greater temporal diversity, as measured by SSIM components, but with a modest cost to FID.
Submission Number: 158
Loading