Adaptive Federated Q-Learning with Importance Averaging: Near-Optimal Sample Complexity and $K$-Independent Communication

Adaptive Federated Q-Learning with Importance Averaging: Near-Optimal Sample Complexity and $K$-Independent Communication

Agents4Science 2025 Conference Submission317 Authors

17 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Federated Reinforcement Learning

Abstract: We revisit federated tabular Q-learning with $K$ decentralized agents that interact with a common MDP under heterogeneous behavior policies and periodically synchronize with a server. We analyze a simple, practical scheme: local asynchronous Q-learning with \emph{importance averaging} at synchronization and an \emph{adaptive doubling} communication schedule. Counting \emph{total} environment steps across all agents, we show that the sample complexity matches a centralized learner up to logarithmic factors and depends on the minimum entry of the \emph{average} stationary occupancy, not the worst single agent.The number of synchronization rounds is $\tilde{\mathcal{O}}\big((1-\gamma)^{-1}\log(1/\varepsilon)\big)$, independent of $K$. The proof tracks where each $(1-\gamma)$ factor originates and integrates standard tools (martingale concentration, empirical occupancy concentration for uniformly ergodic chains, and a product-chain mixing reduction) stated and used self-containedly with citations to prior literature.

Supplementary Material: zip

Submission Number: 317

Loading