Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation

Simon Dima, Simon Fischer, Jobst Heitzig, Joss Oliver

Published: 2024, Last Modified: 17 Dec 2024ADT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions.