Enhancing the Policy Generalization on OOD Tasks via Latent Variable Distribution Enhancement Sampler

Shaobo Li, Jie Lin, Xiangyuan Yang, Hanlin Zhang, Peng Zhao

Published: 01 Jan 2025, Last Modified: 12 Nov 2025IEEE Transactions on Neural Networks and Learning SystemsEveryoneRevisionsCC BY-SA 4.0

Abstract: In standard reinforcement learning, since the uncertainty of task objectives is not adequately considered in the policy training, the policy achieves poor generalization for the out-of-distribution (OOD) tasks. Although considerable efforts have been made to enhance the generalization for OOD tasks, most of these methods overlook the structural information of task representations in latent space during the generation of extrapolative data, resulting in biased and blurred data embeddings, which then affect the policy generalization. To address this issue, we propose a context-based meta-reinforcement learning (meta-RL) method, namely latent variable distribution enhancement sampler (LVDES), which enhances the policy generalization on OOD tasks by providing efficient task representation space and accurate augmentation policy training data for OOD tasks. Specifically, the proposed LVDES consists of four modules: a task inference module, a task separation module, a latent enhancement module (LEM), and a policy module. The task inference module is used to identify the task. The task separation module (TSM) learns a representation space with highly structured separability. The LEM generates relevant additional task trajectories for augmenting policy training data. The policy module learns a policy to solve tasks. By using efficient task representation space and augmented trajectory data, the exploration efficiency and generalization of the policy for OOD tasks can be enhanced by our LVDES method. Extensive experiments are conducted to demonstrate the effectiveness of our method in comparison with existing methods on the MuJoCo and Meta-World benchmarks. The experimental results show that the task completion accuracy of our LVDES on OOD tasks is increased by 60.20%, with the average exploration time being reduced by 62.99% in comparison with the most effective current method, which demonstrates that our LVDES can achieve great policy generalization on OOD tasks.

External IDs:doi:10.1109/tnnls.2025.3614724