WorldPack: Dynamic Frame Compression for Long-context Video World Modeling

Published: 10 Jun 2026, Last Modified: 10 Jun 2026CVPR 2026 Workshop VideoWorldModel PosterEveryoneRevisionsCC BY 4.0
Keywords: diffusion models, world models, memory
Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved even for recent state-of-the-art models, due to the prohibitively high computational costs of long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory. This compression method allows the model to handle more frames without increasing the number of context tokens. The compressed memory consists of two key components: trajectory packing, which enables the model to handle a significantly larger number of frames while maintaining a constant token length, and dynamic compression, which adjusts compression rates based on camera poses to incorporate 3D spatial information into memory management. Together, these mechanisms ensure consistent rollouts even in later stages, where reliable spatial reasoning is crucial. Our performance is evaluated using LoopNav, a Minecraft benchmark specialized in long-term consistency, and RECON, a real-world navigation dataset. We verify that WorldPack notably outperforms strong state-of-the-art models across both domains.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 2
Loading