Abstract: In reinforcement learning, algorithm performance is typically evaluated along two dimensions: computational and statistical complexity. While theoretical researchers often prioritize statistical efficiency - minimizing the number of samples needed to reach a desired accuracy - practitioners, in addition to the sample complexity, also focus on reducing computational costs, such as training time and resource consumption. Bridging these two perspectives requires algorithms able to deliver strong statistical guarantees while remaining computationally efficient in practice. In this paper, we introduce Meta-Step, a meta-algorithm designed to enhance state-of-the-art RL algorithms by improving their computational efficiency while maintaining competitive sample efficiency. Meta-Step is based on the novel notion of $W$-step Markov decision process (MDP), where, instead of performing a single action and transitioning to the next state, the agent executes a sequence of $W$ actions before observing the resulting state and collecting the discounted $W$-step cumulative reward. First, we provide a theoretical analysis of the suboptimality introduced in the optimal policy performance when planning in a $W$-step MDP, highlighting the impact of the environment stochasticity. Second, we apply Meta-Step to GPOMDP, a well-known policy gradient method, and theoretically investigate the advantages of learning in the $W$-step MDP in terms of variance reduction and improved sample complexity. Finally, empirical evaluations confirm that Meta-Step reduces computational costs while preserving - and, in certain scenarios, improving - sample efficiency.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Pan_Xu1
Submission Number: 5494
Loading