Harnessing Density Ratios for Online Reinforcement Learning

Philip Amortila; Dylan J Foster; Nan Jiang; Ayush Sekhari; Tengyang Xie

Harnessing Density Ratios for Online Reinforcement Learning

Philip Amortila, Dylan J Foster, Nan Jiang, Ayush Sekhari, Tengyang Xie

Published: 16 Jan 2024, Last Modified: 16 Mar 2024ICLR 2024 spotlightEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: reinforcement learning theory, online RL, offline RL, hybrid RL, density ratio, marginalized importance weight, weight function, general function approximation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: The notion of density ratio modeling, an emerging topic in offline RL, has been largely absent from online RL. We show a perhaps surprising result, that density ratio-based algorithms have online counterparts.

Abstract: The theories of offline and online reinforcement learning, despite having evolved in parallel, have begun to show signs of the possibility for a unification, with algorithms and analysis techniques for one setting often having natural counterparts in the other. However, the notion of *density ratio modeling*, an emerging paradigm in offline RL, has been largely absent from online RL, perhaps for good reason: the very existence and boundedness of density ratios relies on access to an exploratory dataset with good coverage, but the core challenge in online RL is to collect such a dataset without having one to start. In this work we show---perhaps surprisingly---that density ratio-based algorithms have online counterparts. Assuming only the existence of an exploratory distribution with good coverage, a structural condition known as *coverability* (Xie et al., 2023), we give a new algorithm (GLOW) that uses density ratio realizability and value function realizability to perform sample-efficient online exploration. GLOW addresses unbounded density ratios via careful use of truncation, and combines this with optimism to guide exploration. GLOW is computationally inefficient; we complement it with a more efficient counterpart, HyGLOW, for the Hybrid RL setting (Song et al., 2023) wherein online RL is augmented with additional offline data. HyGLOW is derived as a special case of a more general meta-algorithm that provides a provable black-box reduction from hybrid RL to offline RL, which may be of independent interest.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: learning theory

Submission Number: 8279

Loading