everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Reinforcement learning with skills (RL with skills) is an efficient paradigm for solving sparse-reward tasks by extracting skills from demonstration datasets and learning high-level policy which selects skills. Because each selected skill by high-level policy is executed for multiple consecutive timesteps, the high-level policy is essentially learned in a temporally abstract Markov decision process (TA-MDP) built on the skills, which shortens the task horizon and reduces the exploration cost. However, these skills are usually sub-optimal because of the potential low quality and low coverage of the datasets, which causes the sub-optimal performance in the downstream task. Refining skills is intuitive, but the change of skills will in turn lead to the non-stationarity of the transition dynamics of TA-MDP which we name temporal abstraction shift. To address the dilemma of sub-optimal skills and temporal abstraction shift, we unify the optimization objectives of the entire hierarchical policy consisting of the high-level policy and the low-level policy whose latent space embeds the skills. We theoretically prove that the unified optimization objective guarantees the performance improvement in TA-MDP, and that optimizing the performance in TA-MDP is equivalent to optimizing a lower bound of the performance of the entire hierarchical policy in original MDP. Furthermore, in order to overcome the phenomenon of skill space collapse, we propose the dynamical skill refinement (DSR) mechanism which names our method. The experiment results empirically validate the effectiveness of our method, and show the advantages over the state-of-the-art (SOTA) methods.