Abstract: Misspecifying the reward function of a reinforcement learning agent may cause catastrophic side effects.
In this work, we investigate \textit{distance-impact penalties}: a general-purpose auxiliary reward based on a state-distance measure that captures, and thus can be used to penalise, side effects. We prove that the size of the penalty depends only on an agent's final impact on the environment.
Distance-impact penalties are scalable, general, and immediately compatible with model-free algorithms.
We analyse the sensitivity of an agent's behaviour to the choice of penalty, expanding results about reward-shaping, proving sufficient and necessary conditions for policy-optimality to be invariant to misspecification, and providing error bounds for optimal policies.
Finally, we empirically investigate distance-impact penalties in a range of grid-world environments, demonstrating their ability to prevent side effects whilst permitting task completion.
1 Reply
Loading