DSR: Reinforcement Learning with Dynamical Skill Refinement

Dongxiang Chen; Ying Wen

DSR: Reinforcement Learning with Dynamical Skill Refinement

Dongxiang Chen, Ying Wen

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Reinforcement Learning with Skills, Reinforcement Learning with Demonstrations

TL;DR: A reinforcement learning method of solving sparse-reward tasks using skills extracted from demonstration datasets.

Abstract:

Reinforcement learning with skills (RL with skills) is an efficient paradigm for solving sparse-reward tasks by extracting skills from demonstration datasets and learning high-level policy which selects skills. Because each selected skill by high-level policy is executed for multiple consecutive timesteps, the high-level policy is essentially learned in a temporally abstract Markov decision process (TA-MDP) built on the skills, which shortens the task horizon and reduces the exploration cost. However, these skills are usually sub-optimal because of the potential low quality and low coverage of the datasets, which causes the sub-optimal performance in the downstream task. Refining skills is intuitive, but the change of skills will in turn lead to the non-stationarity of the transition dynamics of TA-MDP which we name temporal abstraction shift. To address the dilemma of sub-optimal skills and temporal abstraction shift, we unify the optimization objectives of the entire hierarchical policy consisting of the high-level policy and the low-level policy whose latent space embeds the skills. We theoretically prove that the unified optimization objective guarantees the performance improvement in TA-MDP, and that optimizing the performance in TA-MDP is equivalent to optimizing a lower bound of the performance of the entire hierarchical policy in original MDP. Furthermore, in order to overcome the phenomenon of skill space collapse, we propose the dynamical skill refinement (DSR) mechanism which names our method. The experiment results empirically validate the effectiveness of our method, and show the advantages over the state-of-the-art (SOTA) methods.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12795

Loading