Position: Constants are Critical in Regret Bounds for Reinforcement Learning

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mainstream research in theoretical RL is currently focused on designing online learning algorithms with regret bounds that match the corresponding regret lower bound up to multiplicative constants (and, sometimes, logarithmic terms). In this position paper, we constructively question this trend, arguing that algorithms should be designed to at least minimize the amount of unnecessary exploration, and we highlight the significant role constants play in algorithms' actual performances. This trend also exacerbates the misalignment between theoretical researchers and practitioners. As an emblematic example, we consider the case of regret minimization in finite-horizon tabular MDPs. Starting from the well-known UCBVI algorithm, we improve the bonus terms and the corresponding regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation successfully demonstrates how improving the multiplicative constants has significant positive effects on the actual empirical performances of the algorithm under analysis. This raises the question of whether ignoring constants when assessing whether algorithms match is the proper approach.
Lay Summary: One goal of Reinforcement Learning (RL) research is to devise algorithms that learn how to make optimal decisions using the concept of *regret* (i.e., how much is lost w.r.t. always making optimal decisions) as a performance metric. From a theoretical perspective, the focus is on matching the order of the theoretical limit of the regret, known as *lower bound*, up to constant (and sometimes logarithmic) multiplicative terms. This, however, can sometimes lead to ignoring the effects that such lower-order terms have on the practical performance of an algorithm. In this paper, we constructively question this approach, considering the tabular RL setting as a main case study. We compare the first algorithm to match the lower bound, UCBVI, its improvement in terms of lower-order terms, and the state-of-the-art algorithm, MVP, showing via an empirical validation the significant impact of lower-order terms on the performance of an algorithm. The goal of this position paper is to highlight the importance of considering lower-order terms when transitioning algorithms from theoretical frameworks to experimental settings, aiming to reduce the gap between theoretical guarantees and real-world performance, and leading to a more integrated view within the RL community.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: ZjcyO
Link To Code: https://github.com/marcomussi/position_constants
Permissions Form: pdf
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Regret Bounds, Reinforcement Learning, Evaluation
Submission Number: 15
Loading