Keywords: Human-in-the-loop reinforcement learning, value-based intervention, quality-aware value shaping
Abstract: Human-in-the-loop reinforcement learning (HIL-RL) incorporates real-time human expert intervention and guidance to address the challenges of brittle reward engineering and learning efficiency. However, existing HIL-RL methods primarily rely on direct action mimicry or rigid value alignment, which inherently suffer from a teacher-quality ceiling—their performance is fundamentally bounded by the human expert's proficiency due to the absence of mechanisms for assessing guidance quality. To overcome this limitation, we propose a novel framework that integrates two synergistic innovations—Value-guided Intervention and Quality-aware Shaping (VIQS)—within a reward-free setting. This design allows the agent to break the teacher-quality ceiling by learning robustly from sparse and potentially imperfect expert guidance. First, we propose a value-guided intervention mechanism where expert intervention is triggered precisely when the agent's chosen action yields significantly lower estimated long-term value compared to an expert-derived reference, preserving autonomy for strategic exploration. Second, we develop a quality-aware shaping mechanism that employs a discriminator to dynamically assess and adaptively incorporate expert intervention data, enabling the agent to filter suboptimal advice while absorbing high-quality guidance. Extensive evaluations are conducted on the challenging MetaDrive benchmark, where pre-trained agents emulate human experts of varying proficiency levels to guide the learning process. Results show that VIQS significantly outperforms prior HIL approaches, while requiring up to 5x fewer interventions. Crucially, it consistently breaks the teacher-quality ceiling across all levels of expert proficiency. Furthermore, integrating our core mechanisms into existing HIL algorithms yields significant and consistent improvements across baselines.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 715
Loading