Abstract: In this paper, we propose a novel method for learning reward functions directly from offline demonstrations.
Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy, eliminating the adversarial interaction typically required between the two.
This results in a more stable and efficient training process.
Our reward module, \textit{SR-Reward}, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics.
By utilizing the Bellman equation, SR-Reward can be learned concurrently with most reinforcement learning (RL) algorithms without altering the existing training pipeline.
We also introduce a negative sampling strategy to mitigate overestimation errors by reducing rewards for out-of-distribution data, thereby enhancing robustness.
This strategy introduces an inherent conservative bias into RL algorithms that employ the learned reward, encouraging them to stay close to the demonstrations where the consequences of the actions are better understood.
We evaluate our method on D4RL as well as Maniskill Robot Manipulation environments, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=LxXOqhPYEw&
Changes Since Last Submission: - The primary revision in this resubmission addresses a bug in our evaluation and plotting code that previously resulted in artificially inflated standard deviations, thereby bringing into question the true performance of our method; this issue has now been resolved, enabling a more accurate assessment of its effectiveness.
- An overall plot of the training pipeline has been added to clearly illustrate the training pipeline using SR-Reward (our proposed reward module) compared to the RL pipeline, this should help clear any doubts about the role of SR-Reward in training offline RL agents.
- We have updated the writing of the paper further and included the suggestions from the previous submission.
- Appendix F: Showing that policies using SR-Reward or true reward behave similarly (similar action trajectories)
- Appendix G: Sensitivity discussions of negative sampling hyperparameters and how to choose them.
- Appendix H: Studying the performance of SR-Reward in online RL settings (TD3) compared to the true reward and discussing the limitations.
Assigned Action Editor: ~Nino_Vieillard1
Submission Number: 4139
Loading