['3c3', '< Abstract: We consider the problem of Federated Q-learning, where M agents aim to collaboratively learn the optimal Q-function of an unknown infinite horizon Markov Decision Process with finite state and action spaces. We investigate the trade-off between sample and communication complexity for the widely used class of intermittent communication algorithms. We first establish the converse result, where we show that any Federated Q-learning that offers a linear speedup with respect to number of agents in sample complexity needs to incur a communication cost of at least Ω( 11-γ ), where γ is the discount factor. We also propose a new Federated Q-learning algorithm, called Fed-DVR-Q, which is the first Federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in Federated Q-learning.', '---', '> Abstract: This paper addresses the fundamental problem of Federated Q-learning, where M agents collaboratively learn the optimal Q-function of an unknown infinite horizon Markov Decision Process with finite state and action spaces. We rigorously investigate the inherent trade-off between sample and communication complexity within the widely adopted class of intermittent communication algorithms. Our contributions are twofold: First, we establish a crucial converse result, demonstrating that any Federated Q-learning algorithm achieving a linear speedup in sample complexity with respect to the number of agents must incur a communication cost of at least Ω( 11-γ ), where γ is the discount factor. Second, we introduce Fed-DVR-Q, a novel Federated Q-learning algorithm that, for the first time, simultaneously achieves *both* order-optimal sample and communication complexities. These combined results offer a complete characterization of the sample-communication complexity trade-off in Federated Q-learning, bridging a significant gap in the literature.', '6,10c6', '< Reinforcement Learning (RL) [Sutton and Barton, 2018] refers to an online sequential decision making paradigm where the learning agent aims to learn an optimal policy, i.e., a policy that maximizes the long-term reward, through repeated interactions with an unknown environment. RL finds applications across a diverse array of fields including, but not limited to, autonomous driving, games, recommendation systems, robotics and Internet of Things (IoT) [Kober et al., 2013, Yurtsever et al., 2020, Silver et al., 2016, Lim et al., 2020].', '< The primary hurdle in RL applications is often the high-dimensional nature of the decision space that necessitates the learning agent to have to access to an enormous amount of data in order to have any hope of learning the optimal policy. Moreover, the sequential collection of such an enormous amount of data through a single agent is extremely time-consuming and often infeasible in practice. Consequently, practical implementations of RL involve deploying multiple agents to collect data in parallel. This decentralized approach to data collection has fueled the design and development of distributed or federated RL algorithms that can collaboratively learn the optimal policy without actually transferring the collected data to a centralized server. Such a federated approach to RL, which does not require the transfer of local data, is gaining interest due to lower bandwidth requirements and lower security and privacy risks. In this work, we focus on federated variants of Q-learning algorithms where the agents collaborate to directly learn the optimal Q-function without forming an estimate of the underlying unknown environment.', '< A particularly important aspect of designing Federated RL algorithms, including Federated Q-learning algorithms, is to address the natural tension between sample and communication complexity. At one end of the spectrum lies the naïve approach of running a centralized algorithm with optimal sample complexity after transferring and combining all the collected data at a central facility/server. Such an approach trivially achieves the optimal sample complexity while suffering from a very high and infeasible communication complexity. On the other hand, several recently proposed algorithms [Khodadadian et al., 2022, Woo et al., 2023] operate in more practical regimes, offering significantly lower communication complexities as compared to the naïve approach at the cost of sub-optimal sample complexities. These results suggest the existence of underlying trade-off between sample and communication complexities of Federated RL algorithms. The primary goal of this work is to better understand this trade-off in context of Federated Q-learning by investigating these following fundamental questions:', '< • Fundamental limit of communication: What is the minimum amount of communication required by a federated Q-learning algorithm to achieve any statistical benefit of collaboration?', '< • Optimal algorithm design: How does one design a federated Q-Learning algorithm that simultaneously offers optimal order sample and communication complexity guarantees i.e., operates on the optimal frontier of sample-communication complexity trade-off?', '---', '> Reinforcement Learning (RL) [Sutton and Barton, 2018] is a powerful paradigm for sequential decision-making, where an agent learns to maximize long-term rewards through interaction with an unknown environment. Its broad applicability spans autonomous driving, games, recommendation systems, robotics, and the Internet of Things (IoT) [Kober et al., 2013, Yurtsever et al., 2020, Silver et al., 2016, Lim et al., 2020]. A significant challenge in many real-world RL applications is the sheer volume of data required to learn an optimal policy, especially in high-dimensional decision spaces. Collecting such vast amounts of data sequentially by a single agent is often prohibitively time-consuming and impractical.', '11a8,15', '> To overcome this, distributed and federated RL algorithms have emerged, enabling multiple agents to collect data in parallel and collaboratively learn an optimal policy. This federated approach, which avoids centralized data transfer, offers compelling advantages such as reduced bandwidth requirements and enhanced security and privacy. Our work specifically focuses on federated Q-learning, where agents collectively learn the optimal Q-function without explicitly modeling the underlying environment.', '> ', '> A critical challenge in designing effective Federated RL algorithms is managing the inherent trade-off between sample and communication complexity. Traditional approaches often fall into two extremes: a centralized method achieves optimal sample complexity but incurs prohibitive communication costs, while some recent distributed algorithms [Khodadadian et al., 2022, Woo et al., 2023] reduce communication at the expense of sub-optimal sample efficiency. This suggests a fundamental trade-off that is not yet fully characterized. This paper aims to provide a comprehensive understanding of this trade-off in Federated Q-learning by addressing two fundamental questions:', '> • Fundamental limit of communication: What is the absolute minimum communication cost required for a federated Q-learning algorithm to achieve any statistical benefit from collaboration, specifically a linear speedup with respect to the number of agents?', '> • Optimal algorithm design: How can a federated Q-learning algorithm be designed to simultaneously achieve order-optimal sample and communication complexity guarantees, thereby operating on the optimal frontier of the sample-communication complexity trade-off?', '> ', '> The remainder of this paper is structured as follows: Section 2 formally introduces the problem formulation and performance measures. Section 3 presents our fundamental lower bounds on communication complexity. Section 4 details Fed-DVR-Q, our novel algorithm that achieves the optimal trade-off. Section 5 provides concluding remarks and discusses future research directions.', '> ', '13c17', '< We consider a setup where M distributed agents collaborate to learn the optimal Q-function of an infinite horizon Markov Decision Process which is defined over a finite state space S and a finite action set A, and has a discount factor of γ ∈ (0, 1). We consider a commonly considered setup in federated learning called the intermittent communication setting, where the clients intermittently share information among themselves with the help of a central server. In this work, we provide a complete characterization of the trade-off between sample and communication complexity under the aforementioned setting by providing answers to both the questions. The main result of this work is twofold and is summarized below.', '---', '> We consider a rigorous theoretical framework where M distributed agents collaboratively learn the optimal Q-function of an infinite horizon Markov Decision Process (MDP), characterized by a finite state space S, a finite action set A, and a discount factor γ ∈ (0, 1). Our analysis focuses on the widely adopted intermittent communication setting in federated learning, where clients periodically exchange information via a central server. This work provides a comprehensive characterization of the fundamental trade-off between sample and communication complexity within this setting, directly addressing the two key questions posed in the introduction. Our main contributions are twofold and are summarized below:', '19,21c23', '< Single agent Q-Learning. Q-Learning has been extensively studied in the single-agent setting in terms of both its asymptotic convergence [Jaakkola et al., 1993, Tsitsiklis, 1994, Szepesvári, 1997, Borkar and Meyn, 2000] and its finite-time sample complexity in both synchronous [Even-Dar and Mansour, 2004, Beck and Srikant, 2012, Wainwright, 2019a, Chen et al., 2020, Li et al., 2023] and asynchronous settings [Chen et al., 2021b, Li et al., 2023, 2021, Qu and Wierman, 2020].', '< Distributed RL. There has also been a considerable effort towards developing distributed and federated RL algorithms. The distributed variants of the classical TD learning algorithm have been investigated in a series of studies [Chen et al., 2021c, Doan et al., 2019, 2021, Sun et al., 2020, Wai, 2020, Wang et al., 2020, Zeng et al., 2021b]. The impact of environmental heterogeneity in federated TD learning was studied in Wang et al. [2023]. A distributed version of actor-critic Table 1: Comparison of sample and communication complexity of various single-agent and Federated Q-learning algorithms for learning an ε-optimal Q-function under the synchronous setting. We hide logarithmic factors and burn-in costs for all results for simplicity of presentation. In the above table, S and A represent state and action spaces respectively and γ denotes the discount factor. We report the communication complexity only in terms of number of rounds as other algorithms assume transmission of real numbers and hence do not report bit level costs. For the lower bound, Azar et al. [2013] and this work establish the bound for sample and communication complexity respectively. algorithms was studied by Shen et al. [2023] where the authors established convergence of their algorithm and demonstrated a linear speed up in the number of agents in their sample complexity bound. Chen et al. [2022] proposed a new distributed actor-critic algorithm which improved the dependence of sample complexity on the error ε and incurs a communication cost of Õ(ε -1 ). Chen et al. [2021a] have proposed a communication efficient distributed policy gradient algorithm and have analyzed its convergence and established a communication complexity of O(1/(M ε)). Xie and Song [2023] adopts a distributed policy optimization perspective, which is different from the Q-learning paradigm considered in this work. Moreover, the algorithm in Xie and Song [2023] obtains a linear communication cost, which is worse than that obtained in our work. Similarly, Zhang et al. [2024] focuses on on-policy learning and incurs a communication cost that depends polynomially on the required error ε. Several other studies [Yang et al., 2023, Zeng et al., 2021a, Lan et al., 2024] have also developed and analyzed other distributed/federated variants of the classical natural policy gradient method [Kakade, 2001]. Assran et al. [2019], Espeholt et al. [2018], Mnih et al. [2016] have developed distributed algorithms to train deep RL networks more efficiently.', '< Distributed Q-learning. Federated Q-learning has been explored relatively recently. Khodadadian et al. [2022] proposed and analyzed a federated Q-learning algorithm in the asynchronous setting with a sample complexity of Õ |S| 2 M µ 5 min (1-γ) 9 ε 2 , where µ min is the minimum entry of the stationary state-action occupancy distribution of the sample trajectories over all agents. Jin et al. [2022] study the impact of environmental heterogeneity across clients in Federated Q-learning. They propose an algorithm where the local environments are different at each client but each client knows their local environment. Under this setting, they propose an algorithm that achieves a sample and communication complexity of O( 1(1-γ) 3 ε ) and O( 1 (1-γ) 3 ε ) rounds respectively. Woo et al. [2023] proposed new algorithms with improved analysis for Federated Q-learning under both synchronous and asynchronous settings. Their proposed algorithm achieves a sample complexity and communication complexity of Õ( |S||A| M (1-γ) 5 ε 2 ) and Õ( M |S||A| 1-γ ) real numbers respectively under the synchronous setting and that of Õ( 1M µavg(1-γ) 5 ε 2 ) and Õ M |S||A| 1-γ real numbers respectively under the asynchronous setting. Here, µ avg denotes the minimum entry of the average stationary state-action occupancy distribution of all agents. In a follow up work, Woo et al. [2024] propose a Federated Qlearning for offline RL in finite horizon setting and establish a sample and communication complexity of Õ( Accuracy-Communication Trade-off in Federated Learning. The trade-off between communication complexity and accuracy (equivalently, sample complexity) has been studied in various federated and distributed learning problems, including stochastic approximation algorithms for convex optimization. Duchi et al. [2014], Braverman et al. [2016] establish the celebrated inverse linear relationship between the error and the communication cost the problem of distributed mean estimation. Similar trade-off for distributed stochastic optimization, multi-armed bandits and linear bandits has been studied and established across numerous studies [Woodworth et al., 2018, 2021, Tsitsiklis and Luo, 1987, Shi and Shen, 2021, Salgia and Zhao, 2023].', '---', '> This section provides an overview of existing literature relevant to our work, categorizing it into single-agent Q-learning, distributed reinforcement learning (RL), and the accuracy-communication trade-off in federated learning. We highlight key developments and position our contributions within this landscape.', '22a25,32', '> Single-Agent Q-Learning. Q-learning has been extensively studied in the single-agent setting, focusing on both its asymptotic convergence properties [Jaakkola et al., 1993, Tsitsiklis, 1994, Szepesvári, 1997, Borkar and Meyn, 2000] and its finite-time sample complexity. Research in this area spans synchronous [Even-Dar and Mansour, 2004, Beck and Srikant, 2012, Wainwright, 2019a, Chen et al., 2020, Li et al., 2023] and asynchronous settings [Chen et al., 2021b, Li et al., 2023, 2021, Qu and Wierman, 2020]. These foundational works provide the theoretical underpinnings for our federated extension.', '> ', '> Distributed Reinforcement Learning. Significant efforts have been dedicated to developing distributed and federated RL algorithms. Distributed variants of the classical Temporal Difference (TD) learning algorithm have been investigated in numerous studies [Chen et al., 2021c, Doan et al., 2019, 2021, Sun et al., 2020, Wai, 2020, Wang et al., 2020, Zeng et al., 2021b]. The impact of environmental heterogeneity in federated TD learning was specifically explored by Wang et al. [2023]. Distributed actor-critic algorithms have also been a focus, with Shen et al. [2023] establishing convergence and linear speedup, and Chen et al. [2022] proposing improvements in sample complexity and communication cost. Communication-efficient distributed policy gradient algorithms have been developed [Chen et al., 2021a], and other distributed/federated natural policy gradient methods have been analyzed [Yang et al., 2023, Zeng et al., 2021a, Lan et al., 2024]. Works like Xie and Song [2023] and Zhang et al. [2024] explore distributed policy optimization and on-policy learning, respectively, with different communication cost characteristics compared to our Q-learning approach. Furthermore, distributed algorithms for training deep RL networks efficiently have been developed by Assran et al. [2019], Espeholt et al. [2018], and Mnih et al. [2016].', '> ', '> Distributed Q-learning. Federated Q-learning, the focus of this paper, is a relatively recent area of exploration. Khodadadian et al. [2022] analyzed a federated Q-learning algorithm in an asynchronous setting, providing sample complexity bounds. Jin et al. [2022] investigated the impact of environmental heterogeneity in Federated Q-learning, proposing an algorithm with specific sample and communication complexities. Woo et al. [2023] introduced new algorithms and improved analyses for Federated Q-learning in both synchronous and asynchronous settings, providing bounds on sample and communication complexities in terms of real numbers transmitted. A follow-up work by Woo et al. [2024] extended Federated Q-learning to offline RL in a finite horizon. Our work distinguishes itself by providing order-optimal guarantees for both sample and communication complexities, including bit-level costs, which is a novel contribution.', '> ', '> Accuracy-Communication Trade-off in Federated Learning. The interplay between communication complexity and accuracy (or sample complexity) is a well-established theme in various federated and distributed learning problems, particularly in stochastic approximation for convex optimization. Duchi et al. [2014] and Braverman et al. [2016] established the inverse linear relationship between error and communication cost in distributed mean estimation. Similar trade-offs have been explored for distributed stochastic optimization, multi-armed bandits, and linear bandits across numerous studies [Woodworth et al., 2018, 2021, Tsitsiklis and Luo, 1987, Shi and Shen, 2021, Salgia and Zhao, 2023]. Our research extends this line of inquiry to the more complex non-linear setting of Federated Q-learning, providing a complete characterization of this trade-off.', '> ', '1538d1547', '< ']
