Offline Reinforcement Learning with Domain-Unlabeled Data

Soichiro Nishimori; Xin-Qiang Cai; Johannes Ackermann; Masashi Sugiyama

Offline Reinforcement Learning with Domain-Unlabeled Data

Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, Masashi Sugiyama

Published: 09 May 2025, Last Modified: 15 Aug 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline RL, Positive-Unlabeled learning, Domain-Unlabeled Data

Abstract: Offline reinforcement learning (RL) is vital in areas where active data collection is expensive or infeasible, such as robotics or healthcare. In the real world, offline datasets often involve multiple “domains” that share the same state and action spaces but have distinct dynamics, and only a small fraction of samples are clearly labeled as belonging to the target domain we are interested in. For example, in robotics, precise system identification may only have been performed for part of the deployments. To address this challenge, we consider Positive-Unlabeled Offline RL (PUORL), a novel offline RL setting in which we have a small amount of target-domain data and a large pool of domain-unlabeled data from multiple domains, including the target domain. For PUORL, we propose a plug-and-play approach that leverages positive-unlabeled (PU) learning to train a domain classifier. The classifier then extracts target-domain samples from the domain-unlabeled data, augmenting the scarce labeled target-domain set. Empirical results on a modified version of the D4RL benchmark demonstrate the effectiveness of our method: even when only $1\\% \\textendash 3\\%$ of the dataset is domain-labeled, our approach accurately identifies target-domain samples and achieves high performance, even under substantial dynamics shift. Our plug-and-play algorithm seamlessly integrates PU learning with existing offline RL pipelines, enabling effective multi-domain data utilization in scenarios where comprehensive domain labeling is prohibitive.

Submission Number: 189

Loading