LaMo: A Latent Motion World Model for Long-Horizon Prediction
Keywords: World models, long-horizon video prediction, latent dynamics, motion tokens, continuous autoregressive modeling, conditional flow matching
Abstract: Visual world models learned from video are often trained to predict future observations from past frames, but pixel-space forecasting is expensive and forces the model to allocate capacity to high-frequency details that matter little for dynamics. Latent-space video models reduce this cost, yet they typically predict a dense latent state at every step, repeatedly carrying largely static information and making long-horizon rollouts harder to sustain. We propose a structured latent dynamics model that instead predicts motion tokens—continuous latent variables that encode first-order change in a learned representation, analogous to optical flow but in latent space. Given the current latent state and a short history of past motion tokens, our model autoregressively samples the next motion token and advances the latent state using a frozen transition model. We instantiate the same contextual Transformer backbone with two alternative probabilistic parameterizations for next-token prediction (a Gaussian mixture head and a conditional flow-matching head) and evaluate long-horizon rollouts on BDD100K, showing improved long-term dynamics modeling in complex driving scenes.
Submission Number: 115
Loading