Success at any cost: value constrained model-free continuous controlDownload PDF

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone
Abstract: Naively applying Reinforcement Learning algorithms to continuous control problems -- such as locomotion and robot control -- to maximize task reward often results in policies which rely on high-amplitude, high-frequency control signals, known colloquially as bang-bang control. While such policies can implement the optimal solution, particularly in simulated systems, they are often not desirable for real world systems since bang-bang control can lead to increased wear and tear and energy consumption and tends to excite undesired second-order dynamics. To counteract this issue, multi-objective optimization can be used to simultaneously optimize both the reward and some auxiliary cost that discourages undesired (e.g. high-amplitude) control. In principle, such an approach can yield the sought after, smooth, control policies. It can, however, be hard to find the correct trade-off between cost and return that results in the desired behavior. In this paper we propose a new constraint-based approach which defines a lower bound on the return while minimizing one or more costs (such as control effort). We employ Lagrangian relaxation to learn both (a) the parameters of a control policy that satisfies the desired constraints and (b) the Lagrangian multipliers for the optimization. Moreover, we demonstrate policy optimization which satisfies constraints either in expectation or in a per-step fashion, and we learn a single conditional policy that is able to dynamically change the trade-off between return and cost. We demonstrate the efficiency of our approach using a number of continuous control benchmark tasks as well as a realistic, energy-optimized quadruped locomotion task.
Keywords: reinforcement learning, continuous control, robotics, constrained optimization, multi-objective optimization
TL;DR: We apply constrained optimization to continuous control tasks subject to a penalty to ensure a lower bound on the return, and learn the resulting conditional Lagrangian multipliers simultaneously with the policy.
15 Replies

Loading