Policy Optimization for Strictly Batch Imitation Learning

Rishabh Agrawal; Nathan Dahlin; Rahul Jain; Ashutosh Nayyar

Policy Optimization for Strictly Batch Imitation Learning

Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar

Published: 10 Oct 2024, Last Modified: 07 Dec 2024NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Policy Optimization, Imitation Learning, Reinforcement Learning, Offline learning

TL;DR: Strictly batch Offline Imitation Learning

Abstract: Imitation Learning (IL) offers a compelling framework within the broader context of Reinforcement Learning (RL) by eliminating the need for explicit reward feedback, a common requirement in RL. In this work, we address IL based solely on observed behavior without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach leverages conditional kernel density estimation and performs policy optimization to ensure the satisfaction of the Markov balance equation associated with the environment. This method performs effectively in discrete and continuous state environments, providing a novel solution to IL problems under strictly offline optimization settings. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.

Submission Number: 84

Loading