Delayed MDPs with Feature Mapping

Jialin Dong, Jiayi Wang, Lin F. Yang

Published: 01 Jan 2024, Last Modified: 26 Jul 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Modern reinforcement learning (RL) frequently suffers from large state and action spaces and delayed feedback in the environment. Recently, feature mapping has become an effective tool to deal with large state and action spaces, where agents use the given features to parameterize the high-dimensional value functions. However, applying feature mapping in an environment with delayed feedback is challenging and unsolved. In the delayed environment, the utilization of feature mapping relies on careful inference of the current environment state from the agent’s observation state and the action sequence. Herein, two consecutive action sequences have overlapping actions, which challenges random-theory-based theoretical regret analysis due to their intricate dependence. In this paper, a new feature-mapping-based framework is proposed to solve the constant delayed Markov Decision Processes (CDMDPs) with m-time-step-delay based on the observation state and action sequence. To address the challenge of statistical coherence brought by the overlapping actions, we design a parameterized-transition CDMDP and a novel method to decouple the statistical coherence brought by the action sequences. The algorithm’s regret bound is $\tilde O\left( {\left( {d + 2 - {\gamma ^m}} \right)\sqrt T /{{\left( {1 - \gamma } \right)}^2}} \right)$ where d is the number of features and γ is the discount factor. It also approaches the lower bound $\Omega \left( {d\sqrt T /{{\left( {1 - \gamma } \right)}^{\frac{3}{2}}}} \right)$. The theoretical analysis demonstrates that our proposed algorithm effectively alleviates the adverse impact of delayed feedback on the agent’s decision-making, and yields a regret bound that is independent of the size of the state and action spaces.