Continual Reinforcement Learning by Reweighting Bellman Targets

Ke Sun; Jun Jin; Xi Chen; Wulong Liu; Linglong Kong

Continual Reinforcement Learning by Reweighting Bellman Targets

Ke Sun, Jun Jin, Xi Chen, Wulong Liu, Linglong Kong

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: continual reinforcement learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We analyze the foundation of continual RL and proposed a practical algorithm via reweighting Bellman targets.

Abstract: One major obstacle to the general AI agent is the inability to solve new problems without forgetting previously acquired knowledge. This deficiency is highly linked to the fact that most reinforcement learning~(RL) methods are based upon the key assumption that the environment transition dynamics and reward functions are fixed. In this paper, we study the continual RL setting by proposing a general analysis framework of catastrophic forgetting in value-based RL based on the defined MDP difference. Within this theoretical framework, we first show that without incorporating any strategies, the Finetune algorithm, one commonly used baseline regarded as the lower bound a continual RL algorithm can achieve, suffers from complete catastrophic forgetting. Moreover, the sequential multi-task RL algorithm, normally viewed as one soft upper bound baseline, can lead to an optimal action-state value function estimator at the cost of almost intractable computation cost in an online alternating algorithm. Motivated by these results, a practical continual RL algorithm is proposed by reweighting the historical and current Bellman targets to trade-off between these lower and upper-bound approaches. We conduct rigorous experiments in the tabular setting to demonstrate our analytical results, suggesting the massive potential of our proposed algorithm in real continual RL scenarios.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6343

Loading