Optimizing Rewards while meeting $\omega$-regular Constraints

Published: 15 May 2024, Last Modified: 14 Nov 2024RLC 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: omega regular, reward shaping, reinforcement learning
Abstract: This paper addresses the problem of synthesizing policies for Markov Decision Processes (MDPs) with hard $\omega$-regular constraints, which include and are more general than safety, reachability, liveness, and fairness. The objective is to derive a policy that not only makes the MDP adhere to the given $\omega$-regular constraint $T$ with certainty but also maximizes the expected reward. We first show that there are no optimal policies for the general constrained MDP (CMDP) problem with $\omega$-regular constraints, which contrasts with simpler problem of CMDPs with safety requirements. Next we show that, despite its complexity, the optimal policy can be approximated within any desired level of accuracy in polynomial time. This approximation ensures both the fulfillment of the $\omega$-regular constraint with probability $1$ and the attainment of a $\epsilon$-optimal reward for any given $\epsilon>0$. The proof identifies specific classes of policies capable of achieving these objectives and may be of independent interest. Furthermore, we introduce an approach to tackle the CMDP problem by transforming it into a classical MDP reward optimization problem, thereby enabling the application of conventional reinforcement learning. We show that proximal policy optimization is an effective approach to identifying near-optimal policies that satisfy $\omega$-regular constraints. This result is demonstrated across multiple environments and constraint types.
Submission Number: 359
Loading