Diversity-Driven Model Ensemble Adaptive Trust Region Policy Optimization

Haotian Xu, Zheng Yan, Junyu Xuan, Guangquan Zhang, Jie Lu

Published: 01 Jan 2026, Last Modified: 12 Mar 2026IEEE Transactions on Systems, Man, and Cybernetics: SystemsEveryoneRevisionsCC BY-SA 4.0
Abstract: Model-based reinforcement learning (MBRL) aims to promote sample efficiency and reduce the number of interactions with the true environment, via learning an environment dynamic model, compared with model-free reinforcement learning (MFRL). However, the success of MBRL heavily relies on two key aspects: model learning and planning. The former refers to learning an accurate model, and the latter aims to improve the behavior policy. In this article, we investigate these two aspects further with model ensemble learning. We design a deep residual attention U-Net (RauNet) with fewer neurons (or weights) than the widely used shallow neural network as our base models and further apply the Hilbert–Schmidt independence criterion (HSIC) as a regularization term to pursue model diversity explicitly for the model ensemble. Furthermore, we propose an adaptive trust region policy optimization (TRPO), in which the parametric Rényi alpha divergence substitutes for the Kullback–Leibler (KL) divergence for measuring the difference between two successive policies, and the alpha value can be adaptively adjusted during TRPO training iterations. This method is called diversity-driven model ensemble adaptive TRPO, or simply diversity-driven model ensemble adaptive trust region policy optimization. Our detailed experiments on six benchmark environments show that our proposed approach is optimal, compared with five state-of-the-art RL techniques.
Loading