Concave Utility Reinforcement Learning with Zero-Constraint Violations

Mridul Agarwal; Qinbo Bai; Vaneet Aggarwal

Concave Utility Reinforcement Learning with Zero-Constraint Violations

Mridul Agarwal, Qinbo Bai, Vaneet Aggarwal

Published: 12 Dec 2022, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. For this, we propose a model-based learning algorithm that also achieves zero constraint violations. Assuming that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures, we solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We use Bellman error-based analysis for tabular infinite-horizon setups which allows analyzing stochastic policies. Combining the Bellman error-based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a high-probability regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors. The proposed method can be applied for optimistic algorithms to obtain high-probability regret bounds and also be used for posterior sampling algorithms to obtain a loose Bayesian regret bounds but with significant improvement in computational complexity.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: 1. Corrected citations format by replacing cite with with citep 2. Corrected Assumption 3.1 language 3. Removed page breaks from Appendix 4. Added references for undiscounted setups

Assigned Action Editor: ~Lihong_Li1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 436

Loading