Keywords: RL Theory, Offline RL, Underlying MDP, Heterogeneous Data Sources, Provable Efficiency
TL;DR: This work investigated the problem of learning an underlying MDP with offline datasets from heterogeneous sources and proposed several provably efficient designs.
Abstract: Most of the existing offline reinforcement learning (RL) studies assume the available dataset is sampled directly from the target environment. However, in some practical applications, the available data are often coming from several related but heterogeneous environments. A theoretical understanding of efficient learning from heterogeneous offline datasets remains lacking. In this work, we study the problem of learning a (hidden) underlying Markov decision process (MDP) based on heterogeneous offline datasets collected from multiple randomly perturbed data sources. A novel HetPEVI algorithm is proposed, which jointly considers two types of uncertainties: sample uncertainties from the finite number of data samples per data source, and source uncertainties due to a finite number of data sources. Building on HetPEVI, we further incorporate reference-advantage decompositions and Bernstein-type penalties to propose the HetPEVI-Adv algorithm. Theoretical analysis not only proves the effectiveness of both HetPEVI and HetPEVI-Adv but also demonstrates the advantage of the latter. More importantly, the results explicitly characterize the learning loss due to the finite heterogeneously realized environments compared with sampling directly from the underlying MDP. Finally, we extend the study to MDPs with linear function approximation and propose the HetPEVI-Lin algorithm that provides additional efficiency guarantees beyond the tabular case.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)