Mnemo: Policy Learning Accelerated by Experience

Published: 02 Mar 2026, Last Modified: 15 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Experience as Operator, Wasserstein Alignment
Abstract: This paper introduces Mnemo, an experience-augmented operator framework for deep reinforcement learning that treats the history of state-dependent policy updates as an evlolution trace object in policy space. Instead of viewing experience only as replayed transition tuples, Mnemo logs action-level policy update signals at visited states and compresses them into a compact, state-indexed trace that biases future updates. For deterministic policies, this trace is instantiated as a similarity-gated gradient field: historical action-gradients are retrieved and smoothed across neighbouring states to produce a state-conditioned update operator. For stochastic policies, past action distributions are aggregated via a Wasserstein barycenter with a KL-type proximal anchor, yielding a data-driven region like operator on policies. We analyse the resulting operators theoretically: in the deterministic case, we show that memory fusion can reduce gradient variance under controlled bias, and in the stochastic case we prove that the Wasserstein–KL alignment operator is a pseudo-contraction that preserves the optimal Bellman fixed point. Empirically, integrating Mnemo into DDPG, TD3, PPO and mappo on MuJoCo, Box2D, Atari and SC2 benchmarks improves sample efficiency or final return on several tasks and matches the baselines elsewhere, while adding no trainable parameters and only a bounded memory overhead comparable to standard replay buffers. These results support a new learning design paradigm in which experience is operationalised as a state-indexed operator on policy updates, an action-centric trace that persistently biases future optimisation, rather than being used only as replayed data for one-step gradient estimation.
Submission Number: 2
Loading