Keywords: Multi-agent Reinforcement Learning, Reinforcement Learning Theory, Policy Evaluation
TL;DR: This paper establishes the mathematical foundation of value decomposition in MARL.
Abstract: Value decomposition has long been a fundamental technique in multi-agent reinforcement learning (RL) and dynamic programming. Specifically, the value function of a global state $(s_1,s_2,\ldots,s_N)$ is often approximated as the sum of local functions: $V(s_1,s_2,\ldots,s_N)\approx\sum_{i=1}^N V_i(s_i)$. This approach has found various applications in modern RL systems. However, the theoretical justification for why this decomposition works so effectively remains underexplored. In this paper, we uncover the underlying mathematical structure that enables value decomposition. We demonstrate that a Markov decision process (MDP) permits value decomposition *if and only if* its transition matrix is not ''entangled''—a concept analogous to quantum entanglement in quantum physics. Drawing inspiration from how physicists measure quantum entanglement, we introduce how to measure the ''Markov entanglement'' and show that this measure can be used to bound the decomposition error in general multi-agent MDPs. Using the concept of Markov entanglement, we proved that a widely-used class of policies, the index policy, is weakly-entangled and enjoys a sublinear $\mathcal O(\sqrt{N})$ scale of decomposition error for $N$-agent systems. Finally, we show how Markov entanglement can be efficiently estimated in practice, providing practitioners
with an empirical proxy for the quality of value decomposition.
Submission Number: 12
Loading