Abstract: Mainstream research in theoretical RL is currently focused on designing online learning algorithms with regret bounds that match the corresponding regret lower bound up to multiplicative constants (and, sometimes, logarithmic terms). In this position paper, we constructively question this trend, arguing that algorithms should be designed to at least minimize the amount of unnecessary exploration, and we highlight the significant role constants play in algorithms' actual performances. This trend also exacerbates the misalignment between theoretical researchers and practitioners. As an emblematic example, we consider the case of regret minimization in finite-horizon tabular MDPs. Starting from the well-known UCBVI algorithm, we improve the bonus terms and the corresponding regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation successfully demonstrates how improving the multiplicative constants has significant positive effects on the actual empirical performances of the algorithm under analysis. This raises the question of whether ignoring constants when assessing whether algorithms match is the proper approach.
Lay Summary: One goal of Reinforcement Learning (RL) research is to devise algorithms that learn how to make optimal decisions using the concept of *regret* (i.e., how much is lost w.r.t. always making optimal decisions) as a performance metric.
From a theoretical perspective, the focus is on matching the order of the theoretical limit of the regret, known as *lower bound*, up to constant (and sometimes logarithmic) multiplicative terms. This, however, can sometimes lead to ignoring the effects that such lower-order terms have on the practical performance of an algorithm.
In this paper, we constructively question this approach, considering the tabular RL setting as a main case study. We compare the first algorithm to match the lower bound, UCBVI, its improvement in terms of lower-order terms, and the state-of-the-art algorithm, MVP, showing via an empirical validation the significant impact of lower-order terms on the performance of an algorithm.
The goal of this position paper is to highlight the
importance of considering lower-order terms when transitioning algorithms from theoretical frameworks
to experimental settings, aiming to reduce the gap between theoretical guarantees and real-world performance, and leading to a more integrated view within the RL community.
Link To Code: https://github.com/marcomussi/position_constants
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Regret Bounds, Reinforcement Learning, Evaluation
Submission Number: 15
Loading