Posterior Sampling for Reinforcement Learning on Graphs

Published: 25 Mar 2025, Last Modified: 25 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Many Markov Decision Processes (MDPs) exhibit structure in their state and action spaces that is not exploited. We consider the case where the structure can be modelled using a directed acyclic graph (DAG) composed of nodes and edges. In this case, each node has a state, and the state transition dynamics are influenced by the states and actions at its parent nodes. We propose an MDP framework, \emph{Directed Acyclic Markov Decision Process} (DAMDP) that formalises this problem, and we develop algorithms to perform planning and learning. Crucially, DAMDPs retain many of the benefits of MDPs, as we can show that Dynamic Programming can find the optimal policy in known DAMDPs. We also demonstrate how to perform Reinforcement Learning in DAMDPs when the transition probabilities and the reward function are unknown. To this end, we derive a posterior sampling-based algorithm that is able to leverage the graph structure to boost learning efficiency. Moreover, we obtain a theoretical bound on the Bayesian regret for this algorithm, which directly shows the efficiency gain from considering the graph structure. We then conclude by empirically demonstrating that by harnessing the DAMDP, our algorithm outperforms traditional posterior sampling for Reinforcement Learning in both a maximum flow problem and a real-world wind farm optimisation task.
Submission Length: Long submission (more than 12 pages of main content)
Code: https://github.com/ArnaudRobert/DAMDP
Assigned Action Editor: ~Wilka_Torrico_Carvalho1
Submission Number: 3512
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview