Conflict-Averse IL-RL: Resolving Gradient Conflicts for Stable Imitation-to-Reinforcement Learning Transfer

TMLR Paper6891 Authors

07 Jan 2026 (modified: 13 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement Learning (RL) and Imitation Learning (IL) offer complementary capabilities: RL can learn high-performing policies but is data-intensive, whereas IL enables rapid learning from demonstrations but is limited by the demonstrator's quality. Combining them offers the potential for improved sample efficiency in learning high-performing policies, yet naïve integrations often suffer from two fundamental issues: (1) negative transfer, where optimizing the IL loss hinders effective RL fine-tuning, and (2) gradient conflict, where differences in the scale or direction of IL and RL gradients lead to unstable updates. We introduce Conflict-Averse IL-RL (CAIR), a general framework that addresses both challenges by combining two key components: (1) Loss Manipulation: an adaptive annealing mechanism utilizing a convex combination of IL and RL losses. This mechanism dynamically increases the weight of the RL loss when its gradient aligns with the IL gradient and decreases it otherwise, mitigating instabilities during the transition from IL to RL. (2) Gradient Manipulation: to further reduce conflict, we incorporate CAGrad to compute a joint gradient that balances IL and RL objectives while avoiding detrimental interference. Under standard trust-region assumptions, CAIR guarantees monotonic improvement in the expected return when the loss weights are annealed monotonically. Our empirical study evaluates CAIR on five sparse-reward MuJoCo domains, where pure RL algorithms typically struggle. Compared against relevant hybrid RL baselines, CAIR improves sample efficiency in three out of five domains and asymptotic performance in two, while performing comparably on the remainder. Notably, CAIR is the only evaluated method that consistently learns to outperform the demonstrator across all five domains. These trends are consistent across multiple combinations of IL (BC, DAgger) and RL (DDPG, SAC, PPO) methods, demonstrating the robustness of the novel framework.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We revised the manuscript to improve empirical evaluation, clarity of presentation, and explanation of the IL–to–RL transition. 1) Expanded the experimental evaluation by adding results on an additional D4RL benchmark domain (AdroitPenSparse-v1). 2) Improved figure readability by adjusting aspect ratios, decluttering plots (e.g., moving legends outside plots), and revising captions to clarify the interpretation of key results (notably Figures 3, 4, 7, and 10). 3) Clarified the IL–to–RL transition mechanism in Section 3.3, including a more precise explanation of gradient alignment and the three-stage training interpretation. 4) Improved reproducibility and exposition by clarifying the source of demonstrations, refining task descriptions, tightening the related work discussion, and clarifying the interpretation of Theorem 1.
Assigned Action Editor: ~Zhongwen_Xu1
Submission Number: 6891
Loading