Abstract: We propose a simple but effective batch imitation learning method. Our algorithm works by solving a sequence of two supervised learning problems, first learning a reward function and then using a batch reinforcement learning oracle to learn a policy. We develop a highly scalable implementation using the transformer architecture and upside-down reinforcement learning. We also analyze an idealized variant of the algorithm for the tabular case and provide a finite-data regret bound. Experiments on a set of ATARI games and MuJoCo continuous control tasks demonstrate good empirical performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Steven_Stenberg_Hansen1
Submission Number: 1860
Loading