Keywords: Continuous Autoregressive Language Models, Autoencoder, Next-Vector Prediction
TL;DR: We replace discrete next-token prediction with continuous next-vector prediction as a paradigm shift to accelerate the training and inference of LLMs.
Abstract: The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, thereby reducing the number of generative steps K-fold. This paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain without access to explicit probabilities. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2351
Loading