Abstract: In online reinforcement learning (RL) with large state spaces, MDPs with low-rank transitions-that is, the transition matrix can be factored into the product of two matrices, left and right-is a highly representative structure that enables tractable exploration. When given to the learner, the left matrix enables expressive function approximation for value-based learning, and this setting has been studied extensively (e.g., in linear MDPs). Similarly, the right matrix induces powerful models for state-occupancy densities. However, using such density features to learn in low-rank MDPs has never been studied to the best of our knowledge, and is a setting with interesting connections to leveraging the power of generative models in RL. In this work, we initiate the study of learning low-rank MDPs with density features. Our algorithm performs reward-free learning and builds an exploratory distribution in a level-by-level manner. It uses the density features for off-policy estimation of the policies' state distributions, and constructs the exploratory data by choosing the barycentric spanner of these distributions. From an analytical point of view, the additive error of distribution estimation is largely incompatible with the multiplicative definition of data coverage (e.g., concentrability). In the absence of strong assumptions like reachability, this incompatibility may lead to exponential or even infinite errors under standard analysis strategies, which we overcome via novel technical tools.
0 Replies
Loading