Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation

Peide Huang; Mengdi Xu; Jiacheng Zhu; Laixi Shi; Fei Fang; Ding Zhao

Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation

Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao

Published: 31 Oct 2022, Last Modified: 04 Aug 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: reinforcement learning, curriculum learning, domain adaptation

TL;DR: We formulate Curriculum Reinforcement Learning as an optimal transport problem to create a geodesic interpolation of task distributions which improve the learning efficiency.

Abstract: Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks, starting from easy ones and gradually learning towards difficult tasks. In this work, we focus on the idea of framing CRL as interpolations between a source (auxiliary) and a target task distribution. Although existing studies have shown the great potential of this idea, it remains unclear how to formally quantify and generate the movement between task distributions. Inspired by the insights from gradual domain adaptation in semi-supervised learning, we create a natural curriculum by breaking down the potentially large task distributional shift in CRL into smaller shifts. We propose GRADIENT which formulates CRL as an optimal transport problem with a tailored distance metric between tasks. Specifically, we generate a sequence of task distributions as a geodesic interpolation between the source and target distributions, which are actually the Wasserstein barycenter. Different from many existing methods, our algorithm considers a task-dependent contextual distance metric and is capable of handling nonparametric distributions in both continuous and discrete context settings. In addition, we theoretically show that GRADIENT enables smooth transfer between subsequent stages in the curriculum under certain conditions. We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance.

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/curriculum-reinforcement-learning-using/code)

23 Replies

Loading