An advantage based policy transfer algorithm for reinforcement learning with measures of transferability

TMLR Paper3921 Authors

09 Jan 2025 (modified: 15 Jan 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning (RL) enables sequential decision-making in complex and high-dimensional environments through interaction with the environment. In most real-world applications, however, a high number of interactions are infeasible. In these environments, transfer RL algorithms, which can be used for the transfer of knowledge from one or multiple source environments to a target environment, have been shown to increase learning speed and improve initial and asymptotic performance. However, most existing transfer RL algorithms are on-policy and sample inefficient, fail in adversarial target tasks, and often require heuristic choices in algorithm design. This paper proposes an off-policy Advantage-based Policy Transfer algorithm, APT-RL, for fixed domain environments. Its novelty is in using the popular notion of ``advantage'' as a regularizer, to weigh the knowledge that should be transferred from the source, relative to new knowledge learned in the target, removing the need for heuristic choices. Further, we propose a new transfer performance measure to evaluate the performance of our algorithm and unify existing transfer RL frameworks. Finally, we present a scalable, theoretically-backed task similarity measurement algorithm to illustrate the alignments between our proposed transferability measure and similarities between source and target environments. We compare APT-RL with several baselines, including existing transfer-RL algorithms, in three high-dimensional continuous control tasks. Our experiments demonstrate that APT-RL outperforms existing transfer RL algorithms and is at least as good as learning from scratch in adversarial tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=1yRo6jwMb7
Changes Since Last Submission: We have addressed all the comments from the reviewers, added significant additional experiments including high dimensional complex tasks, provided additional clarification for better readability and fixed notations. Specifically, we have made the following changes: 1. We have added clarification to address comments by reviewer 25bT 2. A new plot in Figure 5 (a-c) shows ablation study of the regularization parameter, $\beta$ as suggested by reviewer 25bT 3. Following the suggestions of reviewer 25bT, chQa, and Hsjo, we have added fine-tuning as a baseline in all experiments reported in the updated draft. This can be seen in Figure 3 (a-d) plots 4. We have reported the results in all figures in a uniform format as pointed out by reviewers Hsjo and F9xj 5. We fixed a minor bug in the policy update formula of our code and the newly reported results shows corrected values in figure 3 and figure 4 6. We have added a new limitations section as suggested by reviewer 25bT and F9xj 7. We have added clarification for comments provided by reviewer chQa 8. We have taken into consideration the comment of reviewer Hsjo regarding measure theoretic terms. The modified text and title of the draft contains 'measures of transferability' instead of 'metrics of transferability'. We would like to note that our approach of transferability can be a negative value. This is an intentional choice to indicate poor transfer performance for less suitable source tasks. 9. The figures have been updated as suggested by reviewer Hsjo. 10. Based on the comments of reviewer F9xj, we have fixed the notation issue and added clarifications. We also addressed the comments on random policy based data collection in the newly added limitation section. 11. To address the comment of reviewer F9xj about the baseline algorithm 'REPAINT', we have added plot baseline has been shown in Figure 1 in appendix B. As REPAINT is an on-policy algorithm based on PPO, we compare REPAINT against PPO in this plot. While REPAINT works better than PPO in the most similar task, it performs poorly for the rest of the tasks. In addition, our approach, APT-RL, significantly outperforms REPAINT which can be seen from figure 4 in the main draft and figure 1 in the appendix. 12. Following the comment of reviewer F9xj and to clearly show the capabilities of our algorithm, we have removed the simpler 'pendulum' task and added a complex 'humanoid' task in the experiments. We created several adversarial target tasks and our algorithm still outperforms learning from scratch. 13. Based on the comment of reviewer F9xj, we have removed table 2 of the previous draft for better readability.
Assigned Action Editor: ~Romain_Laroche1
Submission Number: 3921
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview