Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: function approximation, regret analysis, reinforcement learning, Markov decision process
Abstract: We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its benefits, introducing the non-linear function raises significant challenges in both *computational* and *statistical* efficiency. The best-known result of Hwang and Oh [2023] has achieved an $\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret, where $\kappa$ is a problem-dependent quantity, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. While this result attains the same rate in $K$ as linear cases, the method requires storing all historical data and suffers from an $\mathcal{O}(K)$ computation cost per episode. Moreover, the quantity $\kappa$ can be exponentially small in the worst case, leading to a significant gap for the regret compared to linear function approximation. In this work, we first address the computational and storage issue by proposing an algorithm that achieves the same regret with only $\mathcal{O}(1)$ cost. Then, we design an enhanced algorithm that leverages local information to enhance statistical efficiency. It not only maintains an $\mathcal{O}(1)$ computation and storage cost per episode but also achieves an improved regret of $\widetilde{\mathcal{O}}(dH^2\sqrt{K} + d^2H^2\kappa^{-1})$, nearly closing the gap with linear function approximation. Finally, we establish the first lower bound for MNL function approximation, justifying the optimality of our results in $d$ and $K$.
Primary Area: Reinforcement learning
Submission Number: 14977
Loading