Keywords: World Model, Diffusion Model, Memory, Generative Models, Video Generation
Abstract: World models aim to predict plausible futures consistent with past observations, a
capability central to planning and decision-making in reinforcement learning. Yet,
existing architectures face a fundamental memory trade-off: transformers preserve
local detail but are bottlenecked by quadratic attention, while recurrent and state-
space models scale more efficiently but compress history at the cost of fidelity. To
overcome this trade-off, we suggest decoupling future–past consistency from any
single architecture and instead leveraging a set of specialized experts. We introduce
a diffusion-based framework that integrates heterogeneous memory models through
a contrastive product-of-experts formulation. Our approach instantiates three
complementary roles: a short-term memory expert that captures fine local dynamics,
a long-term memory expert that stores episodic history in external diffusion weights
via lightweight test-time finetuning, and a spatial long-term memory expert that
enforces geometric and spatial coherence. This compositional design avoids mode
collapse and scales to long contexts without incurring a quadratic cost. Across
simulated and real-world benchmarks, our method improves temporal consistency,
recall of past observations, and navigation performance, establishing a novel
paradigm for building and operating memory-augmented diffusion world models.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 18747
Loading