Archiving Submission: No (non-archival)
Keywords: Continuous Tokenization, Mixture of Gaussians
TL;DR: We introduce a method of continuous-token autoregressive generation using mixture of gaussians
Abstract: Autoregressive sequence models have traditionally relied on discrete tokenizations to leverage cross-entropy training, but this discretization introduces information loss that is costly in high-dimensional domains such as video. Utilizing higher capacity tokens enables higher quality generations, allowing one to use less tokens to represent a single image, and thus improve training and inference time. We propose a continuous-token autoregressive framework that parameterizes each step’s output distribution as a mixture of Gaussians. A lightweight Mixture of Gaussians (MoG) head predicts mixture weights, means, and full covariance factors, and is trained end-to-end by minimizing the Gaussian negative log-likelihood of continuous latent tokens. We demonstrate our approach on conditional video generation from a single image, comparing against a discrete-token and a continuous "mu-only" baseline. Our model achieves the best Frechet Video Distance (FVD), and generates frames with greater temporal diversity, as measured by SSIM components, but with a modest cost to FID.
Submission Number: 49
Loading