Guided Actor-Critic: Off-Policy Partially Observable Reinforcement Learning with Privileged Information

Guided Actor-Critic: Off-Policy Partially Observable Reinforcement Learning with Privileged Information

ICLR 2026 Conference Submission15223 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, POMDPs, off-policy, teacher-student learning

Abstract: Real-world decision-making systems often operate under partial observability due to limited sensing or noisy information, which poses significant challenges for reinforcement learning (RL). A common strategy to mitigate this issue is to leverage privileged information—available only during training—to guide the learning process. While existing approaches such as policy distillation and asymmetric actor-critic methods make use of such information, they frequently suffer from weak supervision or suboptimal knowledge transfer. In this work, we propose Guided Actor-Critic (GAC), a novel off-policy RL algorithm that unifies privileged policy and value learning under a guided policy iteration framework. GAC jointly trains a fully observable policy and a partially observable policy using constrained RL and supervised learning objectives, respectively. We theoretically establish convergence in the tabular case and empirically validate GAC on challenging benchmarks, including Brax, POPGym, and HumanoidBench, where it achieves superior sample efficiency and final performance.

Primary Area: reinforcement learning

Submission Number: 15223

Loading