Time Your Rewards: Learning Temporally Consistent Rewards from a Single Video Demonstration

Published: 31 Oct 2024, Last Modified: 08 Nov 2024CoRL 2024 Workshop WCBMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Learning from Videos, Inverse Reinforcement Learning, Reward Formulation
Abstract: Designing reward functions for tasks with high-dimensional motion sequences, such as controlling humanoid robots, is difficult. A more intuitive approach is to use video demonstrations to specify the desired behavior. Recently, optimal transport (OT) has become popular for learning rewards by aligning learner and demonstration trajectories. However, OT faces two key challenges. First, it lacks temporal constraints, which are crucial for tasks where subgoals must be completed in a specific order. Second, poorly designed reward functions can lead to local minima, allowing the agent to exploit undesired behaviors. Our key insight is to structure the reward function to enforce temporal consistency. We propose a novel class of reward functions SDTW+, which uses Soft Dynamic Time Warping (SDTW) to align trajectories in the correct order and adds a cumulative reward bonus to encourage continuous progress. In experiments, agents trained with SDTW+ achieve a $91.7\%$ success rate on six sequence-following tasks in the Mujoco Humanoid-v4 environment, significantly outperforming OT-based methods.
Submission Number: 32
Loading