Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPODownload PDF

Published: 28 Jan 2022, Last Modified: 26 May 2025ICLR 2022 SubmittedReaders: Everyone
Keywords: Reinforcement learning, policy optimization, hinge loss, policy improvement, PPO-clip
Abstract: Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date. This paper proposes to rethink policy optimization and reinterpret the theory of PPO-clip based on hinge policy optimization (HPO), called to improve policy by hinge loss in this paper. Specifically, we first identify sufficient conditions of state-wise policy improvement and then rethink policy update as solving a large-margin classification problem with hinge loss. By leveraging various types of classifiers, the proposed design opens up a whole new family of policy-based algorithms, including the PPO-clip as a special case. Based on this construct, we prove that these algorithms asymptotically attain a globally optimal policy. To our knowledge, this is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip. We corroborate the performance of a variety of HPO algorithms through experiments and an ablation study.
One-sentence Summary: This paper proposes to rethink policy optimization and reinterpret the theory of PPO-clip through the lens of Hinge Policy Optimization (HPO), which casts policy improvement as a large-margin binary classification problem.
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/hinge-policy-optimization-rethinking-policy/code)
19 Replies

Loading