Abstract: In this paper, we introduce a method for learning a reward function solely from offline demonstrations. Unlike inverse reinforcement learning (IRL), our reward function is learned independently of the policy. This removes the need for an adversarial relationship between the two and provides a more stable training process.
Our reward function, SR-Reward, is based on successor representation (SR). Taking advantage of the nature of SR, it is learned using the Bellman equation and can be trained alongside most reinforcement learning (RL) methods without requiring modification to the training pipeline. We describe our design decisions and the training procedure and show how such a reward function can be trained in combination with off-the-shelf offline RL algorithms.
Additionally, we introduce a negative sampling strategy that reduces the reward value for out-of-distribution data, effectively combating overestimation errors and resulting in a more robust reward function for such data. When applied to the reward function, this strategy introduces an inherent conservatism into the RL algorithms that utilize it.
We evaluate our algorithm on D4RL and find competitive performance compared to offline RL algorithms with access to the true reward as well as imitation learning (IL) algorithms such as behavioral cloning. Furthermore, our ablation studies over data size and data quality provide insights into the strengths and limitations of SR-Reward as a proxy to the true reward.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Improved writing
- Expanded related Works section
- Maniskill experiments: sr_reward vs bc
- Maniskill experiments: negative sampling ablation
- Maniskill experiments: Comparing with Luo er al. (negative sampling)
- Clarifying metrics in Appendix
- Return distribution in Appendix
- Scope of Negative Sampling and rationale behind focusing on the offline setting.
- Realigning the claims to reflect the variance in the results.
- Adding Welch's t-test to the table of results.
- Additional citation to support the observation that broad return distribution in the dataset can lead to higher variance.
- Clarification of the evaluation setup and checkpoint selection.
- Experiment using synthetic data on Toy Maze env (Appendix C)
- Extending the synthetic data experiments to BC and RL with true reward
Assigned Action Editor: ~Matteo_Papini1
Submission Number: 3299
Loading