Plan Your Target and Learn Your Skills: State-Only Imitation Learning via Decoupled Policy OptimizationDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: reinforcement learning, imitation learning
TL;DR: we propose decoupled policy optimization (DPO) via explicitly decoupling the state-to-action mapping policy as a state transition predictor and an inverse dynamics model to solve the ambiguity in state-only imitation learning.
Abstract: State-only imitation learning (SOIL) enables agents to learn from massive demonstrations without explicit action or reward information. However, due to the agnostic executed skills caused by the incomplete guidance, matching the state sequences of the expert by a state-to-action mapping is ambiguous. In this paper, we overcome this issue by introducing hyper-policy as sets of policies that share the same state transition to characterize the optimality in SOIL. Accordingly, we propose decoupled policy optimization (DPO) via explicitly decoupling the state-to-action mapping policy as a state transition predictor and an inverse dynamics model. Intuitively, we teach the agent to plan the target to go and then learn its own skills to reach. Beyond simple supervised learning objectives, we also analyze the compounding error caused by both parts and employ effective solutions to reduce it. Experiments on challenging benchmarks and a real-world driving dataset demonstrate the effectiveness of DPO and its potential of bridging the gap between reality and reinforcement learning simulations.
Supplementary Material: zip
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
17 Replies

Loading