SCALAR: Self-Supervised Composition and Learning of Skills with LLM Planning and RL

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Large Language Models, LLM Planning, World Models, Skill Discovery, Skill Composition, Hierarchical RL, Sparse Rewards, Verifier-based Evaluation, Reward Ensemble, Reward Hacking
TL;DR: We introduce SCALAR, a framework that couples an LLM planner with an RL agent through a dynamically expanding skill library, using a bi-directional feedback loop to refine the LLM's world model and guide exploration in long-horizon tasks..
Abstract: A core challenge in reinforcement learning (RL) is effective exploration, particularly for long-horizon tasks. Recent approaches have explored the utility of large language models (LLMs), combining capabilities to 1) decompose objectives into skills and 2) generate code such as rewards and verifiers. However, ad hoc prompt and program designs, as well as their reliance on single proxy rewards, can lead to reward hacking and hallucinations. Furthermore, synthesizing the correct functions remains challenging without actual environment interactions. To address these challenges, we propose **Self-Supervised Composition and Learning of Skills (SCALAR)**, an iterative, bi-directional framework that couples an LLM planner and low-level RL controllers through a skill library. The skill library is a set of skills that, when composed, define a set of furthest reachable states by the current agent. In SCALAR, the library is iteratively expanded by a high-level LLM planner in conjunction with low-level RL agents. In one direction, an LLM planner uses information in the skill library to propose new skills with (1) preconditions reachable through existing skill compositions and (2) termination conditions unachievable by current skills. Reusing existing skill compositions narrows the task of the RL agent to exploring (2) rather than returning to known states (1). In the other direction, the LLM planner refines its world knowledge *concurrently* with RL training by analyzing successful RL trajectories. We call this process **Pivotal Trajectory Analysis**. We evaluate SCALAR on the Crafter benchmark, a challenging long-horizon task, in which SCALAR achieves **86.3%** diamond-collection success, surpassing the previous state-of-the-art methods in overall performance and convergence speed. These results show that frontier-guided skill composition, together with verifier-based learning and bi-directional refinement, yields substantially more reliable long-horizon control under sparse rewards.
Submission Type: Research Paper (4-9 Pages)
Submission Number: 101
Loading