Auxiliary Reward Generation With Transition Distance Representation Learning

Published: 01 Jan 2025, Last Modified: 15 May 2025IEEE Trans Autom. Sci. Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Reinforcement learning (RL) has shown strengths in challenging sequential decision-making problems. The reward function in RL is crucial to the learning performance, as it quantifies the degree of task completion. In real-world problems, the rewards are predominantly human-designed, which requires laborious tuning, and is susceptible to human cognitive biases. To achieve automatic auxiliary reward generation, we propose a novel representation learning approach that can measure the “transition distance” between states. Building upon these representations, we introduce an auxiliary reward generation technique for both single-task and skill-chaining scenarios without the need for human knowledge. Furthermore, we theoretically show that the proposed auxiliary rewards maintain the policy invariance property, i.e., the generated rewards will not hurt the policy optimality under the original rewards. In the experiment section, we evaluate the proposed approach in both online and offline learning settings in a wide range of tasks, including robot manipulation and locomotion. The experiment results demonstrate the effectiveness of measuring the transition distance and the induced improvement by auxiliary rewards, which promotes better learning efficiency and increases convergent stability. Beyond that, we demonstrate that the learned manipulation policy with the auxiliary rewards in a simulator can be transferred to the real robot, as shown in https://sites.google.com/view/transition-distance-rp/tdrp. Note to Practitioners—The motivation for this paper arises from the need for a technique that enhances robot skill-learning efficiency and performance in both single-task and skill-chaining scenarios. Our research primarily focuses on robot arm manipulation tasks. To accelerate the policy learning process and improve policy performance for executing these tasks, we introduce an auxiliary reward generation technique for both single-task and skill-chaining scenarios without requiring human expertise. This technique leverages the proposed novel representation learning approach, which can measure the “transition distance” between states. During each policy training round, the robot receives a dense reshaped reward created by our approach. Using the policy trained by our method, we successfully control a real Franka Panda robot arm to complete various manipulation tasks.
Loading