General Value Discrepancies Mitigate Partial Observability in Reinforcement Learning

Published: 01 Jul 2025, Last Modified: 21 Jul 2025Finding the Frame (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Partial Observability, Memory Learning, General Value Functions, Successor Features, Lambda Discrepancy
TL;DR: A discrepancy between two general value function estimates is a robust learning signal for mitigating partial observability, especially in sparse reward settings
Abstract: In most realistic sequential decision-making tasks, an agent only observes partial and noisy information about the state of its environment, and must learn to summarize its history for optimal decision-making. Past work has leveraged discrepancies over different TD($\lambda$) value function estimates to reveal and mitigate partial observability. While effective in many cases, the so-called $\lambda$-discrepancy crucially relies on the reward signal to gauge partial observability. We introduce the General Value Discrepancy (GVD), a principled extension of the $\lambda$-discrepancy that computes discrepancies over arbitrary, observable features using the frameworks of general value functions and successor features. Our key theoretical contribution is a proof that---unlike the $\lambda$-discrepancy---GVD can always detect partial observability if it exists, irrespective of the environment's reward structure. By minimizing GVD as an auxiliary objective in deep reinforcement learning, we create a dense and robust learning signal that improves agent performance in a range of challenging partially observable benchmarks.
Submission Number: 33
Loading