RILe: Reinforced Imitation Learning

TMLR Paper4733 Authors

26 Apr 2025 (modified: 08 Jul 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Acquiring complex behaviors is essential for artificially intelligent agents, yet learning these behaviors in high-dimensional settings poses a significant challenge due to the vast search space. Traditional reinforcement learning (RL) requires extensive manual effort for reward function engineering. Inverse reinforcement learning (IRL) uncovers reward functions from expert demonstrations but relies on an iterative process that is often computationally expensive. Imitation learning (IL) provides a more efficient alternative by directly comparing an agent’s actions to expert demonstrations; however, in high-dimensional environments, such direct comparisons often offer insufficient feedback for effective learning. We introduce RILe (Reinforced Imitation Learning), a framework that combines the strengths of imitation learning and inverse reinforcement learning to learn a dense reward function efficiently and achieve strong performance in high-dimensional tasks. RILe employs a novel trainer–student framework: the trainer learns an adaptive reward function, and the student uses this reward signal to imitate expert behaviors. By dynamically adjusting its guidance as the student evolves, the trainer provides nuanced feedback across different phases of learning. Our framework produces high-performing policies in high-dimensional tasks where direct imitation fails to replicate complex behaviors. We validate RILe in challenging robotic locomotion tasks, demonstrating that it significantly outperforms existing methods and achieves near-expert performance across multiple settings.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank the reviewers for their insightful and constructive feedback, which has led to significant improvements in the manuscript. We have incorporated all suggestions to enhance the paper's clarity, correctness, and theoretical grounding. The key changes are summarized below: * Strengthened Theoretical Justification: In response to all reviewers, we have strengthened the paper's theoretical foundation. We added a new Appendix B.3 to provide a detailed motivation for the trainer agent and a new Appendix B.4 that shows why the trainer's long-horizon policy is different from static reward functions. * Methodological Corrections: We corrected a critical typo in the trainer's reward function to ensuring the model's objective is mathematically sound and consistent across the main text (Equations 7, 8, 10) and the algorithm listings in Appendix K. Furthermore, the proof of Lemma 1 is rewritten to resolve ambiguity by analyzing myopic and long-horizon cases separately. * New Empirical Ablation and Discussion of Limitations: To provide a more balanced perspective as requested, we have added a new Appendix I entirely dedicated to discussing the failure modes and limitations of our framework, including potential issues like discriminator overfitting and co-adaptation instability. Appendix J contains a new ablation study, which empirically demonstrates that an entropy-enhanced AIL still produces a static reward landscape, unlike RILe. * Enhanced Clarity and Reproducibility: For improved clarity, we now explicitly state the use of SAC and PPO as our underlying RL algorithms in Appendix G. We have also revised key figures for better readability, clarified the experimental setup for visualizations in Appendix D.1, and refined claims throughout the paper to be more precise.
Assigned Action Editor: ~Oleg_Arenz1
Submission Number: 4733
Loading