Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation
Abstract: Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive.
In addition, the intractable element of the infinite dimensionality of distributions has been overlooked.
In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting.
We first introduce a key notion of Bellman unbiasedness which is essential for exactly learnable and provably efficient distributional updates in an online manner.
Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information.
Secondly, we propose a provably efficient algorithm, SF-LSVI, that achieves a tight regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.
Lay Summary: Reinforcement learning(RL) helps computers learn to make decisions, like choosing the best move in a game or guiding robots through tasks. Traditional RL methods focus only on the average outcome of actions, which might not be enough for safe and reliable decisions in the real world. To solve this, researchers have developed distributional RL, a method that considers all possible outcomes and their probabilities, not just the average.
However, handling these full distributions is tricky because they contain infinite information. This paper introduces a new concept called Bellman Unbiasedness, which allows us to estimate the key information from these distributions using moments efficiently—like the mean and variance—without errors, even when working with just a few samples. The authors also propose a new algorithm, SF-LSVI, that learns decision-making strategies effectively and unbiasedly, even when using general function approximations (such as neural networks).
This work could make RL more trustworthy and applicable to real-world problems, such as safer robotic control, smarter navigation systems, and better AI decision-making.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Distributional RL, Regret Minimization
Submission Number: 14259
Loading