Preference-Guided Reinforcement Learning with Automaton-Based Specifications in Non-Markovian Settings

20 Sept 2025 (modified: 30 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Automaton-Based-Preferences, Non-Markovian, Transfer Learning
TL;DR: We introduce preference-based RL using automaton-generated preferences instead of human feedback, eliminating reward engineering while outperforming traditional automaton-based methods
Abstract: Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide the learning process, replacing explicit reward functions with preferences derived from a deterministic finite automaton (DFA). Unlike conventional approaches that use automata for direct reward specification, our method employs the structure of the DFA to generate preferences over trajectories that are used to learn a reward function, eliminating the need for manual reward engineering. In a static version, the learned reward function is applied directly for policy optimization. In a dynamic version, we iteratively refine both the reward function and policy through repeated updates until convergence. Through comprehensive experiments in both discrete and continuous environments, we demonstrate that our approach enables the RL agent to learn effective policies for tasks with temporal dependencies, outperforming traditional reward engineering and automaton-based baselines such as reward machines and LTL-guided methods. Our results highlight the advantages of automaton-based preferences in handling non-Markovian rewards, offering a scalable, efficient, and human-independent alternative to traditional reward modeling. This work opens new avenues for integrating formal methods into preference-based RL, bridging the gap between task specification and policy optimization.
Primary Area: reinforcement learning
Submission Number: 22754
Loading