Keywords: neural reinforcement learning, efficient coding, distributional reinforcement learning
Abstract: In order to flexibly behave in dynamic environments, agents must learn the temporal structure of causal events. Standard value-based approaches in reinforcement learning
(RL) learn estimates of temporally discounted average future reward, leading to ambiguity about future reward timing and magnitude. Recently, midbrain dopamine neurons (DANs) have been shown to resolve this ambiguity by representing distributional predictive maps of future reward over both time and magnitude in the encoding of reward prediction errors. However, the computational function of such time-magnitude distributions (TMD) in the brain is unknown. Here we present online learning rules for acquiring information-maximising multidimensional distributional estimates, extending classic work in distributional RL from 1D return distributions to efficient representations of distributions of arbitrary dimensionality. In previous distributional RL approaches, the distributional information is largely used for improving representation learning. In our framework, TMDs are the direct substrates for simple policy decoders, enabling rapid risk-sensitive action selection in environments with rich probabilistic temporal reward structure, even under distributional shifts. Finally, we present cross-species neural and behavior evidence, from rodents and humans, consistent with the implementation of this theory in biological circuits. Our results advance a principled computational link between distributional RL and neural coding theory, and establish a role for multi-dimensional distributional predictive maps in rapidly generating sophisticated risk-sensitive policies in environments with complex, multi-modal, distributions of future reward.
Supplementary Material: zip
Primary Area: applications to neuroscience & cognitive science
Submission Number: 18805
Loading