SwiftVI: Time-Efficient Planning and Learning with MDPs

Kasper Overgaard Mortensen; Konstantinos Skitsas; Emil Morre Christensen; Mohammad Sadegh Talebi; Andreas Pavlogiannis; Davide Mottin; Panagiotis Karras

SwiftVI: Time-Efficient Planning and Learning with MDPs

Kasper Overgaard Mortensen, Konstantinos Skitsas, Emil Morre Christensen, Mohammad Sadegh Talebi, Andreas Pavlogiannis, Davide Mottin, Panagiotis Karras

Published: 11 Feb 2025, Last Modified: 13 May 2025MLSys 2025EveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: value iteration, Markov Decision Processes

TL;DR: We provide algorithms that perform value iteration for planning and learning with MDPs scalably in states and actions.

Abstract: Markov decision process (MDPs) find application wherever a decision-making agent acts and learns in an uncertain environment from facility management to healthcare and service provisioning. However, the tasks for learning model parameters and planning the optimal policy such an agent should follow raises high computational cost, calling for solutions that scale to large numbers of actions and states. In this paper, we propose SwiftVI, a suite of algorithms that plan and learn with MDPs scalably by organizing the set of actions for each state in priority queues and deriving bounds for backup Q-values. Our championed solution prunes the set of actions at each state utilizing a tight upper bound and a single priority queue. A thorough experimental study confirms that SwiftVI algorithms achieve high efficiency gains robustly to model parameters.

Submission Number: 36

Loading