Keywords: Behavior Cloning, Learning from demonstration
Abstract: How to make imitation learning more general when demonstrations are relative limited has been a persistent problem in reinforcement learning (RL). Poor demonstrations leads to narrow and biased date distribution, non-Markovian human expert demonstration makes it difficult for the agent to learn, and over-reliance on sub-optimal trajectories can make it hard for the agent to improve its performance. To solve these problems we propose a new algorithm named TD3fG that can smoothly transition from learning from experts to learning from experience. Our algorithm achieve good performance in mujoco environment with limited and sub-optimal demonstrations.
We use behavior cloning to train network as a reference action generator and utilize it in terms of both loss function and exploration noise. This innovation can help agents extract a priori knowledge from demonstrations while reducing the detrimental effects of the poor Markovian properties of the demonstrations. It has better performance compared to the BC+ fine-tuning and DDPGfD approach, especially when the demonstrations are relatively limited. We call our method TD3fG meaning TD3 from a generator.
One-sentence Summary: Using the smooth transition from learning from demonstrations to learning from experience significantly improved performance in some tasks in gym mujoco, especially in situations where demonstrations were relatively limited and sub-optimal.
5 Replies
Loading