Task-Specific Exploration in Meta-Reinforcement Learning via Task Reconstruction

Task-Specific Exploration in Meta-Reinforcement Learning via Task Reconstruction

TMLR Paper6318 Authors

27 Oct 2025 (modified: 26 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement learning trains policies specialized for a single task. Meta-reinforcement learning (meta-RL) improves upon this by leveraging prior experience to train policies for few-shot adaptation to new tasks. However, existing meta-RL approaches often struggle to explore and learn tasks effectively. We introduce a novel meta-RL algorithm for learning to learn task-specific, sample-efficient exploration policies. We achieve this through task reconstruction, an original method for learning to identify and collect small but informative datasets from tasks. To leverage these datasets, we also propose learning a meta-reward that encourages policies to learn to adapt. Empirical evaluations demonstrate that our algorithm achieves higher returns than existing meta-RL methods. Additionally, we show that even with full task information, adaptation is more challenging than previously assumed. However, policies trained with our meta-reward adapt to new tasks successfully.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=1tWfApVmRH

Changes Since Last Submission: We have added additional empirical results on new benchmarks, as recommended. Specifically, we have added results on Meta-World (Yu et al., 2020; McLean et al., 2025) and HopperMass, an adaptation of the classic MuJoCo Hopper environment to the meta-RL scenario (Nakhaeinezhadfard et al., 2025). Meta-World is a popular meta-RL benchmark with tasks inspired by practical scenarios, i.e., robotics. HopperMass extends Hopper by providing a distribution over tasks, with a heavy focus on out-of-distribution adaptation during meta-testing. To improve the analysis of our algorithm's computational complexity, we have added a detailed comparison between the model parameters of our algorithm and those of the baseline algorithms. We show that, while our architecture is large during meta-training, its size decreases by a large amount during meta-testing. Finally, we have restructured the paper, especially Sec. 5, to better present our new results and improve the narrative flow. References: Mohammadreza Nakhaeinezhadfard, Aidan Scannell, and Joni Pajarinen. Entropy regularized task representation learning for offline meta-reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 19616–19623, 2025. Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, KR Zentner, Ryan Julian, JK Terry, Isaac Woungang, et al. Meta-world+: An improved, standardized, rl benchmark. arXiv preprint arXiv:2505.11289, 2025. Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp. 1094–1100. PMLR, 2020.

Assigned Action Editor: ~Dileep_Kalathil1

Submission Number: 6318

Loading