Learning with Language Inference and Tips for Continual Reinforcement Learning

Yifei Chen; Bo Yuan; Xueqian Wang

Learning with Language Inference and Tips for Continual Reinforcement Learning

Yifei Chen, Bo Yuan, Xueqian Wang

23 Sept 2023 (modified: 02 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: continual learning, reinforcement learning, language instructed method

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: This is a paper that proposes a method using language instructions and tips to guide the continual reinforcement learning agent.

Abstract: Acquiring a generalizable policy by continually learning a sequence of tasks is a natural human skill yet challenging for current reinforcement learning algorithms. This is largely due to the dilemma that the agent is expected to quickly adapt to new tasks (plasticity) while retaining the common knowledge from previous tasks (stability). In this work, we present a scheme referred to as “Learning with Language Inference and Tips (LLIT)”, which introduces a rewarding mechanism to parse and ground human knowledge in natural language form to the task space and produces an interpretable policy for each task in task-agnostic setting. LLIT trains a shared policy for each task by inferring and embedding the tips and content of the task. The language instructions inferred by the large language model (LLM) are then used to pre-train an auxiliary reward model with observations' embedding, thereby extracting the semantic representations in tasks. Simultaneously, the instructions and tips embedding will be collected and organized as a prompt pool to capture the correlation among tasks. Hence, closely related tasks exhibit greater neuron overlap in the policy network, stemming from shared semantics, which effectively curbs cross-task interference and forgetfulness. Given the auxiliary reward model trained on previous tasks that interprets human knowledge in natural language, new task adaptation reduces to highly efficient tips aggregation and sub-network finetuning. In experimental studies, LLIT achieves a desirable plasticity-stability trade-off without any task-specfic information. It also outperforms existing continual RL methods in terms of overall performance, forgetting reduction, and adaptation to unseen tasks. Our code is available at https://github.com/llm4crl/LLIT.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7685

Loading