Transforming Policy via Reward AdvancementDownload PDFOpen Website

2019 (modified: 17 Dec 2024)CDC 2019Readers: Everyone
Abstract: Many real world human behaviors can be characterized as sequential decision making processes, such as urban travelers' choices of transport modes and routes [1]. Differing from choices controlled by machines, which in general follows perfect rationality to adopt the policy with highest reward, studies have revealed that human agents make sub-optimal decisions under bounded rationality [2]. Such behaviors can be modeled using maximum causal entropy (MCE) principle [3]. In this paper, we define and investigate a novel reward transformation problem (namely, reward advancement): Recovering the range of additional reward functions that transform the agent's policy from π <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">o</sub> to a predefined target policy π <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">t</sub> under MCE principle. We show that given an MDP and a target policy π <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">t</sub> , there are infinite many additional reward functions that can achieve the desired policy transformation. Moreover, we propose an algorithm to further extract the additional rewards with minimum "cost" to implement the policy transformation. We demonstrated the correctness and accuracy of our reward advancement solution using both synthetic data and a largescale (6 months) passenger-level public transit data from Shenzhen, China.
0 Replies

Loading