Learning optimal policies through contact in differentiable simulation

TMLR Paper1744 Authors

25 Oct 2023 (modified: 27 Jan 2024)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Model-Free Reinforcement Learning (MFRL) has garnered significant attention for its effectiveness in continuous motor control tasks. However, its limitations become apparent in high-dimensional problems, often leading to suboptimal policies even with extensive training data. Conversely, First-Order Model-Based Reinforcement Learning (FO-MBRL) methods harnessing differentiable simulation offer more accurate gradients but are plagued by instability due to exploding gradients arising from the contact approximation model. We propose Adaptive Horizon Actor Critic (AHAC), a massively parallel FO-MBRL approach that truncates trajectory gradients upon encountering stiff contact, resulting in more stable and accurate gradients. We experimentally show this on a variety of simulated locomotion tasks, where our method achieves up to 66% higher asymptotic episodic reward than state-of-the-art MFRL algorithms and less hyper-parameter sensitivity than prior FO-MBRL methods. Moreover, our method scales to high-dimensional motor control tasks while maintaining better wall-clock-time efficiency.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Added new experimental results to section 5 with 10 seeds. - Changed conclusion of the experiments. Notably, we drop our claims of lower variance between runs as this is not supported by our updated results. We have also dialed down our claims of improved asymptotic performance in light of the updated results. - Changed reported statistics to 50% IQM and 95% CI. - Replaced table of results with aggregate metrics, which are easier to digest and reveal more about the results. - Reformatted ablation study and figure. New on 01/17/2024 - Reformatted and reworded ablation study - Added table of results in the appendix - Added more ablation figures and tabular results to appendix - Changed rewards normalisation to use maximum reward achieved by PPO, not the reward achieved at the end of training
Assigned Action Editor: ~Adam_M_White1
Submission Number: 1744
Loading