Mixture of Autoencoder Experts Guidance using Unlabeled and Incomplete Data for Exploration in Reinforcement Learning

Published: 01 Jun 2026, Last Modified: 01 Jun 2026IEEE ICRA 2026 Workshop Xplore OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, exploration, intrinsic motivation, weak supervision, reward shaping
TL;DR: MOE-GUIDE is an assumption-light framework for leveraging weak expert observations in exploration, decoupling expert-similarity modeling from reward deployment to turn sparse, state-only, incomplete data into a reusable intrinsic guidance signal.
Abstract: Large amounts of weak expert-relevant data exist in practice, but current reinforcement learning methods often struggle to use it because it lacks actions, rewards, and complete trajectories. We study exploration from such weak observations and propose a framework that decouples reward construction: an expert-similarity model is learned from sparse state-only observations, and a separate white-box mapping converts expert consistency into an intrinsic reward. In our instantiation, the expert-similarity model is a mixture of autoencoder experts, and the learned signal is applied to successor states with time-dependent scaling, yielding a transient exploration prior toward expert-supported regions. We evaluate the method with Soft Actor-Critic on MuJoCo locomotion using sparse, temporally subsampled state-only observations as a proxy for sparse teleoperation data. The same learned expert-similarity model and mapping family are reused across settings, with only limited mapping-parameter adjustment. Across strong and imperfect demonstrations, the method improves exploration most clearly when extrinsic reward is sparse, partial, or otherwise insufficient for efficient exploration. These results suggest that weak expert observations can already provide useful guidance that is reusable across different reward structures, under substantially weaker assumptions than those required by standard demonstration-based policy or reward-learning methods.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 26
Loading