Beyond Policy Transfer: Self-Supervised Reward Adaptation in New Environments

Xinhu Li; Ayush Jain; Zhaojing Yang; Erdem Biyik; Joseph J Lim

Beyond Policy Transfer: Self-Supervised Reward Adaptation in New Environments

Xinhu Li, Ayush Jain, Zhaojing Yang, Erdem Biyik, Joseph J Lim

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Primary Area: transfer learning, meta learning, and lifelong learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: reinforcement learning, self-supervised learning, domain transfer, test-time adaptation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: For large-scale deployment of autonomous agents, they must perform their tasks not only in their training environment but also in environments they have never seen before, such as robots transferred from controlled testbeds to households. Traditional approaches improve adaptability during training by employing varied environments or during deployment by relying on finetuning. However, the former often fails in unforeseen conditions, while the latter requires access to true reward labels, usually unavailable outside controlled settings. In this work, we address the challenge of adapting to environments with different dynamics and observations from the training environment, without explicit reward signals. We identify that learned task objectives, represented by reward models, are often transferable even when policies are not, as they are more robust against changes in dynamics. However, reward model performance in target environments is vulnerable to new observational shifts like lighting or noise. To address this, our key insight is adapting the reward model at test time, using a self-supervised learning framework. We empirically demonstrate that adapting reward with our method enables policies to solve tasks under new challenges, such as added noise, obstacles, or reversed dynamics, where traditional policy and naive reward transfer methods fail.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8749

Loading