Value Decomposition Fails in Anti-Coordination: A Systematic MARL Comparison for Dynamic Spectrum Access in Dense 6G Networks
Keywords: multi-agent reinforcement learning, dynamic spectrum access, 6G, anti-coordination, value decomposition, QMIX, VDN, QPLEX, MAPPO, IPPO, COMA, multi-player bandits, cognitive radio, CTDE
TL;DR: Across 14 MARL methods on dynamic spectrum access, value decomposition (VDN, QMIX, QPLEX) underperforms even Random; the anti-coordination structure of DSA breaks Q-factorization, while policy gradient and multi-player bandits work.
Abstract: Dynamic spectrum access (DSA) requires secondary users to select different channels to avoid mutual interference, making it a fundamentally anti-coordination problem. We present the largest systematic comparison of multi-agent reinforcement learning for DSA, evaluating fourteen methods across six paradigms: non-learning baselines (Random, Greedy, Slotted ALOHA, p-CSMA), multi-player bandit algorithms (Musical Chairs, SIC-MMAB), independent learners (IDQN, IDRQN, IPPO), and centralized-training decentralized-execution methods spanning value decomposition (VDN, QMIX, QPLEX), counterfactual credit assignment (COMA-lite), and centralized-critic policy gradient (MAPPO). In a realistic simulation with Rayleigh fading and Markov-modulated primary user activity, value decomposition methods achieve only 0.34 to 0.43 bps/Hz averaged over five seeds, dramatically underperforming even the Random baseline at 1.10 bps/Hz (95% CI: [1.096, 1.110]) in our DSA environments. This failure persists across all interventions tested: IGM-complete decomposition (QPLEX), agent-ID symmetry breaking, increased replay ratios, richer observations via PU sensing, extended training to 2.2x the original budget (52,500 steps), and the addition of an idle action, where extended training actually worsens value decomposition performance. Critically, value decomposition remains below Random even at the highest density tested (N = 50): VDN achieves 0.157, QMIX 0.164, and QPLEX 0.168 bps/Hz, compared to Random at 0.299. We show analytically that the constant PU collision rate observed across all methods (around 0.375) is consistent with reward-rational behavior rather than optimization failure: avoiding PU-active channels would concentrate agents onto fewer free channels, increasing SU congestion and reducing throughput. A counterfactual credit assignment baseline (COMA-lite) exhibits bimodal behavior across five seeds: two seeds converge to throughput matching IPPO (around 1.09), while three seeds collapse to near-zero (around 0.09), yielding a 40% success rate that highlights both the promise and current instability of counterfactual methods. Policy gradient methods (IPPO, MAPPO) emerge as the only learned methods to consistently match or surpass stochastic baselines across all densities in our DSA environments.
Submission Number: 17
Loading