STEP-VQ: Sequence-model Agnostic Frame-level Inference with VQ-VAE for Model-based Reinforcement Learning
Keywords: Model based reinforcement learning, VQ-VAE, Representation learning, Sequence Modeling
Abstract: Model-based reinforcement learning (MBRL) from pixels often encodes frames into discrete latent variables that form tokens for sequence model backbones to learn world model dynamics. Previous work adopts two main approaches, each facing distinct limitations. Categorical bottlenecks enable fast frame-level prediction by flattening spatial features into categorical distributions, but suffer from explosive parameter growing with resolution and code dimension. Conversely, vector-quantised variational autoencoder (VQ-VAE) methods achieve parameter efficiency through codebook quantisation but require slow token-level autoregressive prediction within frames, shifting computational complexity to the dynamics model and producing longer sequences that slow training and inference.
We propose STEP-VQ, a novel frame-level VQ-VAE-based world model that enables prediction of entire frames through single forward passes. STEP-VQ follows the latent-imagination paradigm with two components: a world model (VQ-VAE + sequence model) and a behaviour policy. The approach is sequence-model agnostic, working with both Mamba-2 and Transformer architectures without modifications. Our key insight is that fine-grained spatial structure preservation may be unnecessary for effective behaviour learning in latent space, as temporal dynamics can implicitly capture spatial patterns through frame-level prediction. We provide rigorous theoretical analysis grounded in variational inference, showing how our training objective emerges from evidence lower bound (ELBO) optimisation and why Kullback--Leibler (KL) divergence formulations enable superior performance through bidirectional optimisation.
On Atari-100k, STEP-VQ achieves competitive performance whilst dramatically improving efficiency: 11.2× faster training than a strong VQ-VAE based baseline, 4× parameter reduction compared to categorical bottlenecks, and growing advantages at higher resolutions (+27.4\% mean improvement at 96×96). STEP-VQ reaches superhuman performance on 9 games versus 8 for categorical methods, with KL divergence providing 24.5\% improvement over cross-entropy baselines. These results demonstrate that frame-level discrete quantisation offers a practical path to efficient, scalable MBRL using modern sequence architectures.
Primary Area: reinforcement learning
Submission Number: 19531
Loading