Imitating Past Successes can be Very Suboptimal

Benjamin Eysenbach; Soumith Udatha; Ruslan Salakhutdinov; Sergey Levine

Imitating Past Successes can be Very Suboptimal

Benjamin Eysenbach, Soumith Udatha, Ruslan Salakhutdinov, Sergey Levine

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

TL;DR: RL methods that imitate successful trials can learn very suboptimal behavior.

Abstract: Prior work has proposed a simple strategy for reinforcement learning (RL): label experience with the outcomes achieved in that experience, and then imitate the relabeled experience. These outcome-conditioned imitation learning methods are appealing because of their simplicity, strong performance, and close ties with supervised learning. However, it remains unclear how these methods relate to the standard RL objective, reward maximization. In this paper, we prove that existing outcome-conditioned imitation learning methods do not necessarily improve the policy. However, we show that a simple modification results in a method that does guarantee policy improvement. Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/imitating-past-successes-can-be-very/code)

15 Replies

Loading