Abstract: World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, degrading the model's capacity to recall recently observed scenes. In this work, we investigate the effective memory span of transformer-based world models through an analysis of memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory \emph{encoding} and memory \emph{injection} mechanisms, motivating their roles in extending the world model's memory through the lens of residual stream dynamics. We evaluate twenty combinations of four encoding methods and five injection methods in the MemoryMaze environment. Using a state recall evaluation task across multiple imagination horizons, we measure the memory recall capacity of each mechanism and analyze their respective trade-offs in reconstruction quality, latent prediction error, and computational cost. We further ablate the effect of injection depth and compare the best memory-augmented vision transformer against a pure state-space model backbone. Our central finding is that the mLSTM memory encoder outperforms all alternatives in both reconstruction and latent fidelity metrics. Paired with additive injection, it exhibits the strongest recall capabilities at a moderate computational cost while matching or slightly exceeding a pure Mamba backbone.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Andrew_Kyle_Lampinen1
Submission Number: 8547
Loading