Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Kihyun Yu; Duksang Lee; William Overman; Dabeen Lee

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Kihyun Yu, Duksang Lee, William Overman, Dabeen Lee

Published: 09 May 2025, Last Modified: 15 Aug 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safe Reinforcement Learning, Constrained MDPs, Regret Analysis

Abstract: This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of $\widetilde{\mathcal{O}}((\bar C - \bar C_b)^{-1}H^{2.5} S\sqrt{AK})$ where $\bar C$ is the cost budget for an episode, $\bar C_b$ is the expected cost under a safe baseline policy over an episode, $H$ is the horizon, and $S$, $A$ and $K$ are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when $\bar C- \bar C_b=\Omega(H)$, the gap from the regret lower bound of $\Omega(H^{1.5}\sqrt{SAK})$ is $\widetilde{\mathcal{O}}(\sqrt{S})$. The reduction in the regret upper bound is a consequence of our novel reward and cost function estimators. The key is to control the error of estimating the unknown transition kernel over each episode. In particular, we provide a tighter bound on the estimation error for each episode, based on a Bellman-type law of total variance to analyze the expected sum of the variances of value function estimates. The bound is given by a function of the estimated transition kernel, whose choice can be optimized by the algorithm. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

Submission Number: 41

Loading