Safe and Generalizable Reinforcement Learning via Logical Policy Composition

Published: 03 Jun 2026, Last Modified: 03 Jun 2026ALA 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Task Composition, Hierarchical Learning, Skill Generalization, Linear Temporal Logic, LEARN
TL;DR: This work proposes Safe-Comp, a hierarchical RL framework that addresses safety and skill-generalizability limitations by integrating formally defined goals and compositional skill reuse by Linear Temporal Logic.
Abstract: Reinforcement learning (RL) excels at optimizing policies for narrowly defined tasks, but it often struggles with safety and generalization in complex environments. Designing a single reward function to encode both goals and safety constraints can lead to unsafe behavior, as agents may exploit poorly shaped rewards. Similarly, standard RL agents tend to overfit to specific tasks, lacking the ability to reuse skills for new goals without retraining. In this work, we propose Safe-Comp, a hierarchical RL framework that addresses both challenges by integrating formally defined goals and compositional skill reuse. Our approach first trains generic primitive policies for fundamental skills under safety-aware rewards. These skills are then composed at a higher level to satisfy arbitrary task specifications expressed in linear temporal logic (LTL). In particular, given a new task (e.g. achieve goal 'a' while avoiding region 'b', then go to 'c'), we translate the corresponding LTL formula into a deterministic finite automaton (DFA) and plan a safe policy by combining the learned primitives according to the DFA structure. This logical composition ensures that safety rules are never violated and enables zero-shot generalization to an extensive variety of novel task combinations, without additional training.
Journal Edition Interest: No
Supplementary Material: pdf
Submission Number: 49
Loading