Title: The Sample-Communication Complexity Trade-off in Federated Q-Learning

Abstract: We consider the problem of Federated Q-learning, where M agents aim to collaboratively learn the optimal Q-function of an unknown infinite horizon Markov Decision Process with finite state and action spaces. We investigate the trade-off between sample and communication complexity for the widely used class of intermittent communication algorithms. We first establish the converse result, where we show that any Federated Q-learning that offers a linear speedup with respect to number of agents in sample complexity needs to incur a communication cost of at least Ω( 11-γ ), where γ is the discount factor. We also propose a new Federated Q-learning algorithm, called Fed-DVR-Q, which is the first Federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in Federated Q-learning.

Section: Introduction
Reinforcement Learning (RL) [Sutton and Barton, 2018] refers to an online sequential decision making paradigm where the learning agent aims to learn an optimal policy, i.e., a policy that maximizes the long-term reward, through repeated interactions with an unknown environment. RL finds applications across a diverse array of fields including, but not limited to, autonomous driving, games, recommendation systems, robotics and Internet of Things (IoT) [Kober et al., 2013, Yurtsever et al., 2020, Silver et al., 2016, Lim et al., 2020].
The primary hurdle in RL applications is often the high-dimensional nature of the decision space that necessitates the learning agent to have to access to an enormous amount of data in order to have any hope of learning the optimal policy. Moreover, the sequential collection of such an enormous amount of data through a single agent is extremely time-consuming and often infeasible in practice. Consequently, practical implementations of RL involve deploying multiple agents to collect data in parallel. This decentralized approach to data collection has fueled the design and development of distributed or federated RL algorithms that can collaboratively learn the optimal policy without actually transferring the collected data to a centralized server. Such a federated approach to RL, which does not require the transfer of local data, is gaining interest due to lower bandwidth requirements and lower security and privacy risks. In this work, we focus on federated variants of Q-learning algorithms where the agents collaborate to directly learn the optimal Q-function without forming an estimate of the underlying unknown environment.
A particularly important aspect of designing Federated RL algorithms, including Federated Q-learning algorithms, is to address the natural tension between sample and communication complexity. At one end of the spectrum lies the naïve approach of running a centralized algorithm with optimal sample complexity after transferring and combining all the collected data at a central facility/server. Such an approach trivially achieves the optimal sample complexity while suffering from a very high and infeasible communication complexity. On the other hand, several recently proposed algorithms [Khodadadian et al., 2022, Woo et al., 2023] operate in more practical regimes, offering significantly lower communication complexities as compared to the naïve approach at the cost of sub-optimal sample complexities. These results suggest the existence of underlying trade-off between sample and communication complexities of Federated RL algorithms. The primary goal of this work is to better understand this trade-off in context of Federated Q-learning by investigating these following fundamental questions:
• Fundamental limit of communication: What is the minimum amount of communication required by a federated Q-learning algorithm to achieve any statistical benefit of collaboration?
• Optimal algorithm design: How does one design a federated Q-Learning algorithm that simultaneously offers optimal order sample and communication complexity guarantees i.e., operates on the optimal frontier of sample-communication complexity trade-off?

Section: Main Results
We consider a setup where M distributed agents collaborate to learn the optimal Q-function of an infinite horizon Markov Decision Process which is defined over a finite state space S and a finite action set A, and has a discount factor of γ ∈ (0, 1). We consider a commonly considered setup in federated learning called the intermittent communication setting, where the clients intermittently share information among themselves with the help of a central server. In this work, we provide a complete characterization of the trade-off between sample and communication complexity under the aforementioned setting by providing answers to both the questions. The main result of this work is twofold and is summarized below.
• Fundamental bounds on communication complexity of Federated Q-learning: We establish lower bounds on the communication complexity of Federated Q-learning, both in terms of number of communication rounds and the overall number of bits that need to be transmitted in order to achieve any speed up in convergence with respect to the number of agents. Specifically, we show that in order for an intermittent communication algorithm to obtain any benefit of collaboration, i.e., any order of speed up w.r.t. the number of agents, the number of communication rounds must be least Ω( 1(1-γ) log 2 N ) and the number of bits sent by each agent to the server must be least Ω( |S||A| (1-γ) log 2 N ), where N denotes the number of samples taken by the algorithm for each state-action pair.
• Achieving the optimal sample-communication complexity trade-off : We propose a new Federated Q-Learning algorithm called Federated Doubly Variance Reduced Q Learning, Fed-DVR-Q for short, that simultaneously achieves optimal order of sample complexity and the minimal order of communication as dictated by the lower bound. We show that Fed-DVR-Q learns an ε-optimal Q-function in the ℓ ∞ sense with Õ |S||A| M ε 2 (1-γ) 3 i.i.d. samples from the generative model at each agent while incurring a total communication cost of Õ |S||A|
(1-γ) bits per agent across Õ 1 (1-γ) rounds of communication. Thus, Fed-DVR-Q not only improves upon both the sample and communication complexities of existing algorithms, but also is the first algorithm to achieve both order-optimal sample and communication complexities (See Table 1 for a comparison).

Section: Related Work
Single agent Q-Learning. Q-Learning has been extensively studied in the single-agent setting in terms of both its asymptotic convergence [Jaakkola et al., 1993, Tsitsiklis, 1994, Szepesvári, 1997, Borkar and Meyn, 2000] and its finite-time sample complexity in both synchronous [Even-Dar and Mansour, 2004, Beck and Srikant, 2012, Wainwright, 2019a, Chen et al., 2020, Li et al., 2023] and asynchronous settings [Chen et al., 2021b, Li et al., 2023, 2021, Qu and Wierman, 2020].
Distributed RL. There has also been a considerable effort towards developing distributed and federated RL algorithms. The distributed variants of the classical TD learning algorithm have been investigated in a series of studies [Chen et al., 2021c, Doan et al., 2019, 2021, Sun et al., 2020, Wai, 2020, Wang et al., 2020, Zeng et al., 2021b]. The impact of environmental heterogeneity in federated TD learning was studied in Wang et al. [2023]. A distributed version of actor-critic Table 1: Comparison of sample and communication complexity of various single-agent and Federated Q-learning algorithms for learning an ε-optimal Q-function under the synchronous setting. We hide logarithmic factors and burn-in costs for all results for simplicity of presentation. In the above table, S and A represent state and action spaces respectively and γ denotes the discount factor. We report the communication complexity only in terms of number of rounds as other algorithms assume transmission of real numbers and hence do not report bit level costs. For the lower bound, Azar et al. [2013] and this work establish the bound for sample and communication complexity respectively. algorithms was studied by Shen et al. [2023] where the authors established convergence of their algorithm and demonstrated a linear speed up in the number of agents in their sample complexity bound. Chen et al. [2022] proposed a new distributed actor-critic algorithm which improved the dependence of sample complexity on the error ε and incurs a communication cost of Õ(ε -1 ). Chen et al. [2021a] have proposed a communication efficient distributed policy gradient algorithm and have analyzed its convergence and established a communication complexity of O(1/(M ε)). Xie and Song [2023] adopts a distributed policy optimization perspective, which is different from the Q-learning paradigm considered in this work. Moreover, the algorithm in Xie and Song [2023] obtains a linear communication cost, which is worse than that obtained in our work. Similarly, Zhang et al. [2024] focuses on on-policy learning and incurs a communication cost that depends polynomially on the required error ε. Several other studies [Yang et al., 2023, Zeng et al., 2021a, Lan et al., 2024] have also developed and analyzed other distributed/federated variants of the classical natural policy gradient method [Kakade, 2001]. Assran et al. [2019], Espeholt et al. [2018], Mnih et al. [2016] have developed distributed algorithms to train deep RL networks more efficiently.
Distributed Q-learning. Federated Q-learning has been explored relatively recently. Khodadadian et al. [2022] proposed and analyzed a federated Q-learning algorithm in the asynchronous setting with a sample complexity of Õ |S| 2 M µ 5 min (1-γ) 9 ε 2 , where µ min is the minimum entry of the stationary state-action occupancy distribution of the sample trajectories over all agents. Jin et al. [2022] study the impact of environmental heterogeneity across clients in Federated Q-learning. They propose an algorithm where the local environments are different at each client but each client knows their local environment. Under this setting, they propose an algorithm that achieves a sample and communication complexity of O( 1(1-γ) 3 ε ) and O( 1 (1-γ) 3 ε ) rounds respectively. Woo et al. [2023] proposed new algorithms with improved analysis for Federated Q-learning under both synchronous and asynchronous settings. Their proposed algorithm achieves a sample complexity and communication complexity of Õ( |S||A| M (1-γ) 5 ε 2 ) and Õ( M |S||A| 1-γ ) real numbers respectively under the synchronous setting and that of Õ( 1M µavg(1-γ) 5 ε 2 ) and Õ M |S||A| 1-γ real numbers respectively under the asynchronous setting. Here, µ avg denotes the minimum entry of the average stationary state-action occupancy distribution of all agents. In a follow up work, Woo et al. [2024] propose a Federated Qlearning for offline RL in finite horizon setting and establish a sample and communication complexity of Õ( Accuracy-Communication Trade-off in Federated Learning. The trade-off between communication complexity and accuracy (equivalently, sample complexity) has been studied in various federated and distributed learning problems, including stochastic approximation algorithms for convex optimization. Duchi et al. [2014], Braverman et al. [2016] establish the celebrated inverse linear relationship between the error and the communication cost the problem of distributed mean estimation. Similar trade-off for distributed stochastic optimization, multi-armed bandits and linear bandits has been studied and established across numerous studies [Woodworth et al., 2018, 2021, Tsitsiklis and Luo, 1987, Shi and Shen, 2021, Salgia and Zhao, 2023].

Section: Problem Formulation and Preliminaries
In this section, we provide a brief background of Markov Decision Processes, outline the performance measures for Federated Q-learning algorithms and describe the class of intermittent communication algorithms considered in this work.

Section: Markov Decision Processes
In this work, we focus on an infinite-horizon Markov Decision Process (MDP), denoted by M, over a state space S and an action space A and with a discount factor γ ∈ (0, 1). Both the state and action spaces are assumed to be finite sets. In an MDP, the state s evolves dynamically under the influence of actions based on a probability transition kernel, P : (S × A) × S → [0, 1]. The entry P (s ′ |s, a) denotes the probability of moving to state s ′ when an action a is taken in the state s. An MDP is also associated with a deterministic reward function r : S × A → [0, 1], where r(s, a) denotes the immediate reward obtained for taking the action a in the state s. Thus, the transition kernel P along with the reward function r completely characterize an MDP. In this work, we consider the synchronous setting, where each agent has access to an independent generative model or simulator from which they can draw independent samples from the unknown underlying distribution P (•|s, a) for each state-action pair (s, a) [Kearns and Singh, 1998].
A policy π : S → ∆(A) is a rule for selecting actions across different states, where ∆(A) denotes the simplex over A and π(a|s) denotes the probability of choosing action a in a state s. Each policy π is associated with a state value function and a state-action value function, or the Q-function, denoted by V π and Q π respectively. V π and Q π measure the expected discounted cumulative reward attained by π starting from a particular state s and state-action pair (s, a) respectively. Mathematically, V π and Q π are given as
V π (s) := E ∞ t=0 γ t r(s t , a t ) s 0 = s ; Q π (s, a) := E ∞ t=0 γ t r(s t , a t ) s 0 = s, a 0 = a , (1)
where a t ∼ π(•|s t ) and s t+1 ∼ P ( • |s t , a t ) for all t ≥ 0. The expectation is taken w.r.t. the randomness in the trajectory {s t , a t } ∞ t=1 . Since the rewards lie in [0, 1], it follows immediately that both the value function and Q-function lie in the range [0, 1 1-γ ]. An optimal policy π ⋆ is a policy that maximizes the value function uniformly over all the states and it has been shown that such an optimal policy π ⋆ always exists [Puterman, 2014]. The optimal value and Q-functions are those corresponding to that of an optimal policy π ⋆ are denoted as
V ⋆ := V π ⋆ and Q ⋆ := Q π ⋆
respectively. The optimal Q-function, Q ⋆ , is also the unique fixed point of the Bellman operator T : S × A → S × A, given by
(T Q)(s, a) = r(s, a) + γ • E s ′ ∼P (•|s,a) max a ′ ∈A Q(s ′ , a ′ ) .
(2) Q-learning [Watkins and Dayan, 1992] aims to learn the optimal policy by first learning Q ⋆ as the solution to the fixed point equation T Q = Q and then obtain a deterministic optimal policy via the maximization π ⋆ (s) = arg max a Q ⋆ (s, a).
Let Z ∈ S |S||A| be a random vector whose (s, a) th coordinate is drawn from the distribution P (•|s, a), independently of all other coordinates. We define the random operator T Z : (S × A) → (S × A) as
(T Z Q)(s, a) = r(s, a) + γV (Z(s, a)),(3)
where V (s ′ ) = max a ′ ∈A Q(s ′ , a ′ ). The operator T Z can be interpreted as the sample Bellman Operator, where we have the relation
T Q = E Z [T Z Q] for all Q-functions.
Lastly, the federated learning setup considered in this work consists of M agents, where all the agents face a common, unknown MDP, i.e., the transition kernel and the reward functions are the same across agents, which is popularly known as the homogeneous setting. For a given value of ε ∈ (0, 1 1-γ ), the objective of agents is to collaboratively learn an ε-optimal estimate (in the ℓ ∞ sense) of the optimal Q-function of the unknown MDP.

Section: Performance Measures
We measure the performance of a Federated Q-learning algorithm A using two metrics -sample complexity and communication complexity. For a given MDP M, let Q M (A , N, M ) denote the estimate of Q ⋆ M , the optimal Q-function of the MDP M, returned by an algorithm A , when given access to N i.i.d. samples from the generative model for each (s, a) pair at all the M agents. The minimax error rate of the algorithm A , denoted by ER(A ; N, M ), is defined as
ER(A ; N, M ) := sup M=(P,r) E ∥ Q M (A , N, M ) -Q ⋆ M ∥ ∞ ,(4)
where the expectation is taken over the samples and any randomness in the algorithm. Given a value of ε > 0, the sample complexity of A , denoted by SC(A ; ε, M ) is given as
SC(A ; ε, M ) := |S||A| • min{N ∈ N : ER(A ; N, M ) ≤ ε}.(5)
Similarly, we can also define a high-probability version for any δ ∈ (0, 1) as follows:
SC(A ; ε, M, δ) := |S||A| • min{N ∈ N : Pr(sup M ∥ Q M (A , N, M ) -Q ⋆ M ∥ ∞ ≤ ε) ≥ 1 -δ}.
We measure the communication complexity of any federated learning algorithm both in terms of frequency of information exchange and total number of bits uploaded by the agents. For each agent m, let C m round (A ; N ) and C m bit (A ; N ) respectively denote the number of times agent m sends a message to the server and the total number of bits uploaded by agent m to the server when an algorithm A is run with N i.i.d. samples from the generative model for each (s, a) pair at each agent. The communication complexity of A , when measured in terms of frequency of communication and total number of bits exchanged, is given by
CC round (A ; N ) := 1 M M m=1 C m round (A ; N ); CC bit (A ; N ) := 1 M M m=1 C m bit (A ; N ),(6)
respectively. Similarly, for a given value of ε ∈ (0, 1 1-γ ), we can also define CC round (A ; ε) and CC bit (A ; ε) based on when A is run to guarantee a minimax error of at most ε.

Section: Intermittent Communication Algorithms
Algorithm 1: A generic algorithm A end for 8: end for 9: return Q T In this work, we consider a popular class of federated learning algorithms referred to as algorithms with intermittent communication. The intermittent communication setting provides a natural framework to extend single agent Qlearning algorithms to the distributed setting. As the name suggests, under this setting, the agents intermittently communicate with each other, sharing their updated beliefs about Q ⋆ . Between two communication rounds, each agent updates their belief about Q ⋆ using stochastic fixed point iteration based on the locally available data, similar to a single agent setup. Such intermittent communication algorithms have been extensively studied and used to establish lower bounds on communication complexity of distributed stochastic convex optimization [Woodworth et al., 2018[Woodworth et al., , 2021]].
1: Input : T, R, {η t } T t=1 , C = {t r } R r=1 , B 2: Set Q m 0 ← 0 for all agents m 3: for t = 1, 2, . . . , T do 4: for m = 1, 2, . . . , M do 5: Compute Q m t-
A generic Federated Q-learning algorithm with intermittent communication is outlined in Algorithm 1. It is characterized by the following five parameters: (i) total number of updates T ; (ii) the number of communication rounds R; (iii) a step size schedule
{η t } T t=1 ; (iv) a communication schedule {t r } R r=1 ; (v) batch size B. During the t th iteration, each agent m computes { T Z b (Q m t-1 )} B b=1
, a minibatch of sample Bellman operators at the current estimate Q m t-1 , using B samples from the generative model for each (s, a) pair, and obtains an intermediate local estimate using the Q-learning update as follows:
Q m t-1 2 = (1 -η t )Q m t-1 + η t B B b=1 T Z b (Q m t-1 ).(7)
Here η t ∈ (0, 1] is the step-size chosen corresponding to the t th time step. The intermediate estimates are averaged based on a communication schedule C = {t r } R r=1 consisting of R rounds, i.e.,
Q m t = 1 M M j=1 Q j t-1 2 if t ∈ C, Q m t-1 2 otherwise. (8
)
In the above equation, the averaging step can also be replaced with any distributed mean estimation routine that includes compression to control the bit level costs. Without loss of generality, we assume that Q m 0 = 0 for all agents m and t R = T , i.e., the last iterates are always averaged. It is straightforward to note that the number of samples taken by an intermittent communication algorithm is BT , i.e, N = BT and the communication complexity CC round = R.

Section: Lower Bound
In this section, we investigate the first of the two questions regarding the lower bound on communication complexity. The following theorem establishes a lower bound on the communication complexity of a Federated Q-learning algorithm with intermittent communication. Theorem 1. Assume that γ ∈ [5/6, 1) and the state and action spaces satisfy |S| ≥ 4 and |A| ≥ 2. Let A be a Federated Q-learning algorithm with intermittent communication that is run for T ≥ max{16, 1 1-γ } steps with a step size schedule of either η t :=
1 1+cη(1-γ)t or η t := η for all 1 ≤ t ≤ T . If R = CC round (A ; N ) ≤ c 0 (1 -γ) log 2 N ; or CC bit (A ; N ) ≤ c 1 |S||A| (1 -γ) log 2 N
for some universal constants c 0 , c 1 > 0 then, for all choices of communication schedule, batch size B, c η > 0 and η ∈ (0, 1), the minimax error of A satisfies
ER(A ; N, M ) ≥ C γ log 3 N √ N ,
for all M ≥ 2 and N = BT . Here C γ > 0 is a constant that depends only on γ.
The above theorem states that in order for an intermittent communication algorithm to obtain any benefit of collaboration, i.e., for the error rate ER(A ; N, M ) to decrease w.r.t. number of agents, the number of communication rounds must be least Ω(
1 (1-γ) log 2 N
). This implies that any Federated Q-learning algorithm that offers order optimal sample complexity, and thereby also a linear speed up with respect to the number of agents, must have at least Ω( 1 (1-γ) log 2 N ) rounds of communication and transmit Ω( |S||A| (1-γ) log 2 N ) bits of information per agent. This characterizes the converse relation for the sample-communication tradeoff in Federated Q-learning. We would like to point out that our lower bound extends to the asynchronous setting as the assumption of i.i.d. noise corresponding to a generative model is a special case of Markovian noise observed in asynchronous setting.
The lower bound on the communication complexity of Federated Q-learning is a consequence of the bias-variance trade-off that governs the convergence of the algorithm. While a careful choice of step-sizes alone is sufficient to balance this trade-off in the centralized setting, the choice of communication schedule also plays an important role in balancing this trade-off in the federated setting. The local steps between two communication rounds induce a positive estimation bias that depends on the standard deviation of the iterates and is a well-documented issue of "over-estimation" in Q-learning [Hasselt, 2010]. Since such a bias is driven by local updates, it does not reflect any benefit of collaboration. During a communication round, the averaging of iterates across agents allows the algorithm an opportunity to counter this bias by reducing the effective variance of the updates through averaging. In our analysis, we show that if the communication is infrequent, the local bias becomes the dominant term and averaging of iterates is insufficient to counter the impact of the positive bias induced by the local steps. As a result, we do not observe any statistical gains when the communication is infrequent. The analysis is inspired the analysis of Q-learning by Li et al. [2023] and is based on analyzing the convergence of an intermittent communication algorithm on a specifically chosen "hard" instance of MDP. Please refer to Appendix B for a detailed proof. Remark 1 (Communication complexity of policy evaluation). Several recent studies [Liu andOlshevsky, 2023, Tian et al., 2024] established that a single round of communication is sufficient to achieve linear speedup of TD learning for policy evaluation, which do not contradict with our results focusing on Q-learning for policy learning. The latter is more involved due to the nonlinearity of the Bellman optimality operator. Specifically, if the operator whose fixed point is to be found is linear in the decision variable (e.g., the value function in TD learning) then the fixed point update only induces a variance term corresponding to the noise. However, if the operator is non-linear, then in addition to the variance term, we also obtain a bias term in the fixed point update. While the variance term can be controlled with one-shot averaging, more frequent communication is necessary to ensure that the bias term is small enough. Remark 2 (Extension to asynchronous Q-learning). We would like to point out that our lower bound extends to the asynchronous setting [Li et al., 2023] as the assumption of i.i.d. noise corresponding to a generative model is a special case of Markovian noise observed in the asynchronous setting.

Section: The Fed-DVR-Q algorithm
Having characterized the lower bound on the communication complexity of Federated Q-learning, we explore our second question of interest -designing a federated Q-learning algorithm that achieves this lower bound while simultaneously offering an optimal order of sample complexity.
We propose a new Federated Q-learning algorithm, Fed-DVR-Q, that achieves not only a communication complexity of CC round = Õ( 1 1-γ ) and CC bit = Õ( |S||A| 1-γ ) but also the optimal order of sample complexity (upto logarithmic factors), thereby providing a tight characterization of the achievability frontier that matches with the converse result derived in the previous section.

Section: Algorithm Description
Algorithm 2: Fed-DVR-Q 1: Input : Error bound ε > 0, failure probability δ > 0 2: k ← 1, Q (0) ← 0 3: // Set parameters as described in Sec. 4.1.3 4: for k = 1, 2, . . . , K do 5:
Q (k) ← REFINEESTIMATE(Q (k-1) , B, I, L k , D k , J) 6: k ← k + 1 7: end for 8: return Q (K)
Fed-DVR-Q proceeds in epochs. During an epoch k ≥ 1, the agents collaboratively update Q (k-1) , the estimate of Q ⋆ obtained at the end of previous epoch, to a new estimate Q (k) , with the aid of the sub-routine called REFINEESTIMATE. The sub-routine REFINEESTIMATE is designed to ensure that the suboptimality gap, ∥Q (k) -Q ⋆ ∥ ∞ , reduces by a factor of 2 at the end of every epoch. Thus, at the end of K = O(log(1/ε)) epochs, Fed-DVR-Q obtains a ε-optimal estimate of Q ⋆ , which is then set to be the output of the algorithm. Please refer to Alg. 2 for a pseudocode.

Section: 4.1.1
The REFINEESTIMATE sub-routine REFINEESTIMATE, starting from Q, an initial estimate of Q ⋆ , uses variance reduced Q-learning updates to obtain an improved estimate of Q ⋆ . It is characterized by four parameters -the initial estimate Q, the number of local iterations I, the recentering sample size L and the batch size B, which can be appropriately tuned to control the quality of the returned estimate. Additionally, it also takes input two parameters D and J required by the compressor.
The first step in REFINEESTIMATE is to collaboratively approximate T Q for the variance reduced updates. To this effect, each agent m builds an approximation of T Q as follows:
T (m) L (Q) := 1 ⌈L/M ⌉ ⌈L/M ⌉ l=1 T Z (m) l (Q),(9)
where
{Z (m) 1 , Z (m) 2 , . . . , Z (m) ⌈L/M ⌉ } are ⌈L/M ⌉ i.i.d. samples with Z (m) 1 ∼ Z. Each agent sends C T (m) L (Q) -Q; D, J , a compressed version of the difference T (m) L
(Q) -Q, to the server, which collects all the estimates from the agents and constructs the estimate
T L (Q) = Q + 1 M M m=1 C T (m) L (Q) -Q; D, J(10)
and sends it back to the agents for the variance reduced updates. We defer the description of the compression routine to the end of this section. Equipped with the estimate T L (Q), REFINEESTIMATE constructs a sequence {Q i } I i=1 using the following iterative update scheme initialized with Q 0 = Q. During the i th iteration, each agent m carries out the following update:
Q m i-1 2 = (1 -η)Q i-1 + η T (m) i Q i-1 -T (m) i Q + T L (Q) .(11)
In the above equation, η ∈ (0, 1) is the step size and
T (m) i Q := 1 B z∈Z (m) i T z Q, where Z (m) i
is the minibatch of B i.i.d. random variables drawn according to Z, independently at each agent m for all iterations i. Each agent then sends a compressed version of the update, i.e., C Q
m i-1 2 -Q i-1 ; D, J ,
to the server, which uses them to compute the next iterate
Q i = Q i-1 + 1 M M m=1 C Q m i-1 2 -Q i-1 ; D, J ,(12)
and broadcast it to the clients. After I such updates, the obtained iterate Q I is returned by the routine.
A pseudocode of the REFINEESTIMATE routine is given in Algorithm 3 in Appendix A.

Section: The Compression Operator
The compressor, C (•; D, J), used in the proposed algorithm Fed-DVR-Q is based on the popular stochastic quantization scheme. In addition to the input vector Q to be quantized, the quantizer C takes two input parameters D and J. D corresponds to an upper bound on ℓ ∞ norm of Q, i.e., ∥Q∥ ∞ ≤ D. J corresponds to the resolution of the compressor, i.e., number of bits used by the compressor to represent each coordinate of the output vector.
The compressor first splits the interval [0, D] into 2 J -1 intervals of equal length where
0 = d 1 < d 2 , • • • < d 2 J = D correspond
to end points of the intervals. Each coordinate of Q is then separately quantized as follows. The value of the n th coordinate, C (Q) [n], is set to be d jn-1 with probability
dj n -Q[n] dj n -dj n-1
and to d jn with the remaining probability, where j n := min{j :
d j < Q[i] ≤ d j+1 }.
It is straightforward to note that each coordinate of C (Q) can be represented using J bits.

Section: Setting the parameters
The desired convergence of the iterates {Q (k) } is obtained by carefully choosing the parameters of the sub-routine REFINEESTIMATE and the compression operator C . For all epochs k ≥ 1, we set the number of iterations I and the batch size B of REFINEESTIMATE and the number of bits J of the compressor C to be ⌈
2 η(1-γ) ⌉, ⌈ 2 M ( 12γ (1-γ) ) 2 log( 8KI|S||A| δ )⌉ and ⌈log 2 ( 70 η(1-γ) 2 M log( 8KI|S||A|δ
))⌉ respectively. The total number of epochs is set to
K = ⌈ 1 2 log 2 ( 1 1-γ )⌉ + ⌈ 1 2 log 2 ( 1 (1-γ)ε 2 )⌉.
The recentering sample sizes L k and bounds D k are set to be the following functions of epoch index k:
L k := 19600 (1 -γ) 2 log 8KI|S||A| δ • 4 k if k ≤ K 0 4 k-K0 if k > K 0 ; D k := 16 • 2 -k 1 -γ , (13
)
where
K 0 = ⌈ 1 2 log 2 ( 1 1-γ )⌉.
The piecewise definition of L k is crucial to obtain the optimal dependence with respect to 1 1-γ , similar to the two-step procedure outlined in Wainwright [2019b].

Section: Performance Guarantees
The following theorem characterizes the sample and communication complexity of Fed-DVR-Q.
Theorem 2. Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Under the federated learning setup described in Section 2.1, the sample and communication complexities of the Fed-DVR-Q algorithm, when run with the choice of parameters described in Sec. 4.1.3 and a learning rate η ∈ (0, 1), satisfy the following relations for some universal constant C 1 > 0:
SC(Fed-DVR-Q; ε, M, δ) ≤ C 1 ηM (1 -γ) 3 ε 2 log 2 1 (1 -γ)ε log 8KI|S||A| δ , CC round (Fed-DVR-Q; ε, δ) ≤ 16 η(1 -γ) log 2 1 (1 -γ)ε , CC bit (Fed-DVR-Q; ε, δ) ≤ 32|S|A| η(1 -γ) log 2 1 (1 -γ)ε log 2 70 η(1 -γ) 2 M log 8KI|S||A| δ .
A proof of Theorem 2 can be found in Appendix C. A few implications of the theorem are in order.
Optimal Sample-Communication complexity trade-off. As shown by the above theorem, Fed-DVR-Q offers a linear speed up in the sample complexity with respect to the number of agents while simultaneously achieving the same order of communication complexity as dictated by the lower bound derived in Theorem 1, both in terms of frequency and bit level complexity. Moreover, Fed-DVR-Q is the first Federated Q-Learning algorithm that achieves a sample complexity with optimal dependence on all the salient parameters, i.e., |S|, |A| and 1 1-γ , in addition to linear speedup w.r.t. to number of agents and thereby bridges the existing gap between upper and lower bounds on sample complexity for Federated Q-learning. Thus, Theorem 1 and 2 together provide a characterization of optimal operating point of the sample-communication complexity trade-off in Federated Q-learning.
Role of Minibatching. The commonly adopted approach in intermittent communication algorithm is to use a local update scheme that takes multiple small (i.e., B = O(1)), noisy updates between communication rounds, as evident from the algorithm design in Khodadadian et al. [2022], Woo et al. [2023] and even numerous FL algorithms for stochastic optimization McMahan et al. [2017], Haddadpour et al. [2019], Khaled et al. [2020]. In Fed-DVR-Q, we replace the local update scheme of taking multiple small, noisy updates by a single, large update with smaller variance, obtained by averaging the noisy updates over a minibatch of samples. The use of updates with smaller variance in variance reduced Q-learning yields the algorithm its name. While both the approaches result in similar sample complexity guarantees, the local update scheme requires more frequent averaging across clients to ensure that the bias of the estimate, also commonly referred to as "client drift", is not too large. On the other hand, the minibatching approach does not encounter the problem of bias accumulation from local updates and hence can afford more infrequent averaging allowing Fed-DVR-Q to achieve optimal order of communication complexity.
Compression. Fed-DVR-Q is the first algorithm in Federated Q-Learning to analyze and establish communication complexity at the bit level. All existing studies on Federated RL focus only on the frequency of communication and assume transmission of real numbers with infinite bit precision. On the other hand, the our analysis provides a more holistic view point of communication complexity and provides bounds at the bit level, which is of great practical significance. While some recent other studies [Wang et al., 2023] also consider quantization in Federated RL, their objective is to understand the impact of message size on convergence with no constraint on the frequency of communication, unlike the holistic viewpoint adopted in this work.

Section: Conclusion and Future Directions
We presented a complete characterization of the sample-communication trade-off for Federated Q-learning algorithms with intermittent communication. We showed that no Federated Q-learning algorithm with intermittent communication can achieve a linear speedup with respect to the number of agents if its number of communication rounds are sublinear in 1 1-γ . We also proposed a new Federated Q-learning algorithm called Fed-DVR-Q that uses variance reduction along with minibatching to achieve optimal-order sample and communication complexities. In particular, we showed that Fed-DVR-Q has a sample complexity of Õ( |S||A| M (1-γ) 3 ε 2 ), which is order-optimal in all salient problem parameters, and a communication complexity of Õ( 11-γ ) rounds and Õ( |S||A| 1-γ ) bits. The results in this work raise several interesting questions that are worth exploring. While we focus on the tabular setting in this work, it is of great interest to investigate to the trade-off in other settings where we use function approximation to model the Q ⋆ and V ⋆ functions. Moreover, it is interesting to explore the trade-off in the finite horizon setting, where there is no discount factor. Furthermore, it is also worthwhile to explore if the communication complexity can be further reduced by going beyond the class of intermittent communication algorithms.

Section: A Additional details about REFINEESTIMATE
We outline below the pseudocode of the REFINEESTIMATE routine described in Sec. 4.1.1. Send 
C ( T (m) L (Q) -Q; D, J) to the server 6: Receive 1 M M m=1 C ( T (m) L (Q) -Q; D, J)
Compute Q m i-1 2 according to Eqn. (11) 14: Send C (Q m i-1 2 -Q i-1 ; D, J) to the server 15: Receive 1 M M m=1 C (Q m i -Q i-1 ; D, J)
from the server and compute Q i according to Eqn. ( 12) 16:
end for 17: end for 18: return Q I

Section: B Proof of Theorem 1
In this section, we prove the main result of the paper, the lower bound on the communication complexity of federated Q-learning algorithms. At a high level, the proof consists of the following three steps.
Introducing the "hard" MDP instance. The proof builds upon analyzing the behavior of a generic algorithm A outlined in Algorithm 1 over a particular instance of MDP. The particular choice of MDP is inspired by, and borrowed from, other lower bound proofs in the single-agent setting [Li et al., 2023] and helps highlight core issues that lie at the heart of the sample-communication complexity trade-off. Following Li et al. [2023], the construction is first over a small state-action space that allows us to focus on a simpler problem before generalizing it to larger state-action spaces.
Establishing the performance of intermittent communication algorithms. In the second step, we analyze the error of the iterates generated by an intermittent communication algorithm A . The analysis is inspired by the single-agent analysis in Li et al. [2023], which highlights the underlying bias-variance trade-off. Through careful analysis of the algorithm dynamics in the federated setting, we uncover the impact of communication on the bias-variance trade-off and the resulting error of the iterates to obtain the lower bound on the communication complexity.
Generalization to larger MDPs. As the last step, we generalize our construction of the "hard" instance to more general state-action space and extend our insights to obtain the statement of the theorem.

Section: B.1 Introducing the "hard" instance
We first introduce an MDP instance M h that we will use throughout the proof to establish the result. Note that this MDP is identical to the one considered in Li et al. [2023] to establish the lower bounds on the performance of single-agent Q-learning algorithm. It consists of four states S = {0, 1, 2, 3}. Let A s denote the action set associated with the state s. The probability transition kernel and the reward function of M h is given as follows:
A 0 = {1} P (0|0, 1) = 1 r(0, 1) = 0, (14a) A 1 = {1, 2} P (1|1, 1) = p P (0|1, 1) = 1 -p r(1, 1) = 1, (14b) P (1|1, 2) = p P (0|1, 2) = 1 -p r(1, 2) = 1, (14c) A 2 = {1} P (2|2, 1) = p P (0|2, 1) = 1 -p r(2, 1) = 1, (14d) A 3 = {1} P (3|3, 1) = 1 r(3, 1) = 1,(14e)
where the parameter p = 4γ -1 3γ . We have the following results about the optimal Q and V functions of this hard MDP instance. Lemma 1 ( [Li et al., 2023, Lemma 3]). Consider the MDP M h constructed in Eqn. (14). We have,
V ⋆ (0) = Q ⋆ (0, 1) = 0 V ⋆ (1) = Q ⋆ (1, 1) = Q ⋆ (1, 2) = V ⋆ (2) = Q ⋆ (2, 1) = 1 1 -γp = 3 4(1 -γ) V ⋆ (3) = Q ⋆ (3, 1) = 1 1 -γ .
Throughout the next section of the proof, we focus on this MDP with four states and two actions. In Appendix B.4, we generalize the proof to larger state-action spaces.

Section: B.2 Notation and preliminary results
For convenience, we first define some notation that will be used throughout the proof.
Useful relations of the learning rates. We consider two kinds of step size sequences that are commonly used in Q-learning -the constant step size schedule, i.e., η t = η for all t ∈ {1, 2, . . . , T } and the rescaled linear step size schedule, i.e., η t = 1 1+cη(1-γ)t , where c η > 0 is a universal constant that is independent of the problem parameters.
We define the following quantities:
η (t) k = η k t i=k+1 (1 -η i (1 -γp)) for all 0 ≤ k ≤ t,(15)
where we take η 0 = 1 and use the convention throughout the proof that if a product operation does not have a valid index, we take the value of that product to be 1. For any integer 0 ≤ τ < t, we have the following relation, which will be proved at the end of this subsection for completeness:
t k=τ +1 (1 -η k (1 -γp)) + (1 -γp) t k=τ +1 η (t) k = 1.(16)
Similarly, we also define,
η (t) k = η k t i=k+1 (1 -η i ) for all 0 ≤ k ≤ t,(17)
which satisfies the relation
t k=τ +1 (1 -η k ) + t k=τ +1 η (t) k = 1. (18
)
for any integer 0 ≤ τ < t. The claim follows immediately by plugging p = 0 in (16). Note that for constant step size, the sequence η
(t)
k is clearly increasing. For the rescaled linear step size, we have,
η (t) k-1 η (t) k = η k η k-1 (1 -η k ) = 1 - (1 -c η (1 -γ))η k 1 -c η (1 -γ)η k ≤ 1 (19) whenever c η ≤ 1 1-γ . Thus, η(t)
k is an increasing sequence as long as
c η ≤ 1 1-γ . Similarly, η(t)
k is also clearly increasing for the constant step size schedule. For the rescaled linear step size schedule, we have,
η (t) k-1 η (t) k = η k η k-1 (1 -η k (1 -γp)) ≤ η k η k-1 (1 -η k ) ≤ 1, whenever c η ≤ 1 1-γ .
The last bound follows from Eqn. ( 19).
Proof of (16). We can show the claim using backward induction. For the base case, note that,
(1 -γp)η (t) t + (1 -γp)η (t) t-1 = (1 -γp)η t + (1 -γp)η t-1 (1 -(1 -γp)η t ) = 1 -(1 -η t (1 -γp))(1 -η t-1 (1 -γp)) = 1 - t k=t-1 (1 -η k (1 -γp)),
as required. Assume ( 16) is true for some τ . We have,
(1 -γp) t k=τ η (t) k = (1 -γp)η t τ + (1 -γp) t k=τ +1 η (t) k = (1 -γp)η τ t k=τ +1 (1 -η k (1 -γp)) + 1 - t k=τ +1 (1 -η k (1 -γp)) = 1 - t k=τ (1 -η k (1 -γp)),
thus completing the induction step. 
P m t (s ′ |s, a) = 1 B B b=1 P m t,b (s ′ |s, a),(20)
where
P m t,b (s ′ |s, a) = 1{Z m t,b (s, a) = s ′ } for s ′ ∈ S.
Preliminary relations of the iterates. We state some preliminary relations regarding the evolution of the Q-function and the value function across different agents that will be helpful for the analysis later.
We begin with the state 0, where we have Q m t (0, 1) = V m t (0) = 0 for all agents m ∈ [M ] and t ∈ [T ]. This follows almost immediately from the fact that state 0 is an absorbing state with zero reward. Note that
Q m 0 (0, 1) = V m 0 (0) = 0 holds for all clients m ∈ [M ]. Assuming that Q m t-1 (0, 1) = V m t-1(
0) = 0 for all clients for some time instant t -1, by induction, we have,
Q m t-1/2 (0, 1) = (1 -η t )Q m t-1 (0, 1) + η t (γV m t-1 (0)) = 0. Consequently, Q m
t (0, 1) = 0 and V m t (0) = 0, for all agents m, irrespective of whether there is averaging.
For state 3, the iterates satisfy the following relation:
Q m t-1/2 (3, 1) = (1 -η t )Q m t-1 (3, 1) + η t (1 + γV m t-1 (3)) = (1 -η t )Q m t-1 (3, 1) + η t (1 + γQ m t-1 (3, 1)) = (1 -η t (1 -γ))Q m t-1 (3, 1) + η t ,
where the second step follows by noting V m t (3) = Q m t (3, 1). Once again, one can note that averaging step does not affect the update rule implying that the following holds for all m ∈ [M ] and t ∈ [T ]:
V m t (3) = Q m t (3, 1) = t k=1 η k t i=k+1 (1 -η i (1 -γ)) = 1 1 -γ 1 - t i=1 (1 -η i (1 -γ)) ,(21)
where the last step follows from Eqn. ( 16) with p = 1.
Similarly, for state 1 and 2, we have,
Q m t-1/2 (1, 1) = (1 -η t )Q m t-1 (1, 1) + η t (1 + γ P m t (1|1, 1)V m t-1 (1)),(22)
Q m t-1/2 (1, 2) = (1 -η t )Q m t-1 (1, 2) + η t (1 + γ P m t (1|1, 2)V m t-1 (1)),(23)
Q m t-1/2 (2, 1) = (1 -η t )Q m t-1 (2, 1) + η t (1 + γ P m t (2|2, 1)V m t-1 (2)). (24
)
Since the averaging makes a difference in the update rule, we further analyze the update as required in later proofs.

Section: B.3 Main analysis
We first focus on establishing a bound on the number of communication rounds, i.e., CC round (A ) (where we drop the dependency with other parameters for notational simplicity), and then use this lower bound to establish the bound on the bit level communication complexity CC bit (A ).
To establish the lower bound on CC round (A ) for any intermittent communication algorithm A , we analyze the convergence behavior of A on the MDP M h . We assume that the averaging step in line 6 of Algorithm 1 is carried out exactly. Since the use of compression only makes the problem harder, it is sufficient for us to consider the case where there is no loss of information in the averaging step for establishing a lower bound. Lastly, throughout the proof, without loss of generality we assume that
log N ≤ 1 1 -γ ,(25)
otherwise, the lower bound in Theorem 1 reduces to the trivial lower bound.
We divide the proof into following three parts based on the choice of learning rates and batch sizes:
1. Small learning rates: For constant learning rates, 0 ≤ η < 1 (1-γ)T and for rescaled linear learning rates, the constant c η satisfies c η ≥ log T .
2. Large learning rates with small η T /(BM ): For constant learning rates, η ≥ 1 (1-γ)T and for rescaled linear learning rates, the constant c η satisfies 0 ≤ c η ≤ log T ≤ 1 1-γ (c.f. ( 25)). Additionally, the ratio η T BM satisfies η T BM ≤ 1-γ 100 . 3. Large learning rates with large η T /(BM ): We have the same condition on the learning rates as above. However, in this case the ratio η T BM satisfies η T BM > 1-γ 100 .
We consider each of the cases separately in the following three subsections.

Section: B.3.1 Small learning rates
In this subsection, we prove the lower bound for small learning rates, which follow from similar arguments in Li et al. [2023].
For this case, we focus on the dynamics of state 2. We claim that the same relation established in Li et al. [2023] continues to hold, which will be established momentarily:
E[V m T (2)] =   1 M M j=1 E[V j T (2)]   = T k=1 η (t) k = 1 -η (T ) 0 1 -γp .(26)
Consequently, for all m ∈ [M ], we have
V ⋆ (2) -E[V m T (2)] = η (T ) 0 1 -γp . (27
)
To obtain lower bound on
V ⋆ (2) -E[V m T (2)],
we need to obtain a lower bound on η (T ) 0 , which from [Li et al., 2023, Eqn. (120)] we have log(η
(T ) 0 ) ≥ -1.5 T t=1 η(1 -γp) ≥ -2 T t=1 1 t log T ≥ -2 =⇒ η (T ) 0 ≥ e -2
when T ≥ 16 for both choices of learning rates. On plugging this bound in ( 27), we obtain,
E[∥Q m T -Q ⋆ ∥ ∞ ] ≥ E[|Q ⋆ (2) -Q m T (2)|] ≥ V ⋆ (2) -E[V m T (2)] ≥ 3 4e 2 (1 -γ) √ N(28)
holds for all m ∈ [M ], N ≥ 1 and M ≥ 2. Thus, it can be noted that the error rate ER(A ; N, M ) is bounded away from a constant value irrespective of the number of agents and the number of communication rounds. Thus, even with CC round = Ω(T ), we will not observe any collaborative gain if the step size is too small.
Proof of (26). Recall that from (24), we have,
Q m t-1/2 (2, 1) = (1 -η t )V m t-1 (2) + η t (1 + γ P m t (2|2, 1)V m t-1 (2)).
Here, Q m t-1 (2, 1) = V m t-1 (2) as the second state has only a single action.
• When t is not an averaging instant, we have,
V m t (2) = Q m t (2, 1) = (1 -η t )V m t-1 (2) + η t (1 + γ P m t (2|2, 1)V m t-1 (2)). (29
)
On taking expectation on both sides of the equation, we obtain,
E[V m t (2)] = (1 -η t )E[V m t-1 (2)] + η t (1 + γE[ P m t (2|2, 1)V m t-1 (2)]) = (1 -η t )E[V m t-1 (2)] + η t 1 + γE[ P m t (2|2, 1)]E[V m t-1 (2)] = (1 -η t )E[V m t-1 (2)] + η t 1 + γpE[V m t-1 (2)] = (1 -η t (1 -γp))E[V m t-1 (2)] + η t .(30)
In the second step, we used the fact that
P m t (2|2, 1) is independent of V m t-1 (2). • Similarly, if t is an averaging instant, we have, V m t (2) = Q m t (2, 1) = 1 M M j=1 Q j t-1/2 (2, 1) = (1 -η t ) 1 M M j=1 V j t-1 (2) + 1 M M j=1 η t (1 + γ P j t (2|2, 1)V j t-1 (2)).(31)
Once again, upon taking expectation we obtain,
E[V m t (2)] = (1 -η t ) 1 M M j=1 E[V j t-1 (2)] + 1 M M j=1 η t (1 + γE[ P j t (2|2, 1)V j t-1 (2)]) = (1 -η t ) 1 M M j=1 E[V j t-1 (2)] + 1 M M j=1 η t (1 + γpE[V j t-1 (2)]) = (1 -η t (1 -γp))   1 M M j=1 E[V j t-1 (2)]   + η t .(32)
Eqns. ( 30) and ( 32) together imply that for all t ∈ [T ],
1
M M m=1 E[V m t (2)] = (1 -η t (1 -γp)) 1 M M m=1 E[V m t-1 (2)] + η t .(33)
On unrolling the above recursion with V m 0 = 0 for all m ∈ [M ], we obtain the desired claim (26).
B.3.2 Large learning rates with small η T

Section: BM
In this subsection, we prove the lower bound for case of large learning rates when the ratio η T BM is small. For the analysis in this part, we focus on the dynamics of state 1. Unless otherwise specified, throughout the section we implicitly assume that the state is 1.
We further define a key parameter that will play a key role in the analysis:
τ := min{k ∈ N : ∀ t ≥ k, η t ≤ η k ≤ 3η t }. (34
)
It can be noted that for constant step size sequence τ = 1 and for rescaled linear stepsize τ = T /3.
Step 1: introducing an auxiliary sequence. We define an auxiliary sequence Q m t (a) for a ∈ {1, 2} and all t = 1, 2, . . . , T to aid our analysis, where we drop the dependency with state s = 1 for simplicity. The evolution of the sequence
Q m t is defined in Algorithm 4, where V m t = max a∈{1,2} Q m t (a).
In other words, the iterates { Q m t } evolve exactly as the iterates of Algorithm 1 except for the fact that sequence { Q m t } is initialized at the optimal Q-function of the MDP. We would like to point out that we assume that the underlying stochasticity is also identical in the sense that the evolution of both Q m t and Q m t is governed by the same P m t matrices. The following lemma controls the error between the iterates Q m t and Q m t , allowing us to focus only on Q m t .
Algorithm 4: Evolution of Q 
1: Input : T, R, {η t } T t=1 , C = {t r } R r=1 , B 2: Set Q m 0 (a) ← Q ⋆ (1,
Q m t (1, a) -Q m t (a) ≥ - 1 1 -γ t i=1 (1 -η i (1 -γ)).
By Lemma 2, bounding the error of the sequence Q m t allows us to obtain a bound on the error of Q m t . To that effect, we define the following terms for any t ≤ T and all m ∈ [M ]:
∆ m t (a) := Q m t (a) -Q ⋆ (1, a); a = 1, 2; ∆ m t,max = max a∈{1,2}
∆ m t (a).
In addition, we use ∆ t =1 M M m=1 ∆ m t to denote the error of the averaged iterate 1 , and similarly,
∆ t,max := max a∈{1,2} ∆ t (a).(35)
We first derive a basic recursion regarding ∆ m t (a). From the iterative update rule in Algorithm 4, we have,
∆ m t (a) = (1 -η t )∆ m t-1 (a) + η t (1 + γ P m t (1|1, a) V m t-1 -Q ⋆ (1, a)) = (1 -η t )∆ m t-1 (a) + η t γ( P m t (1|1, a) V m t-1 -pV ⋆ (1)) = (1 -η t )∆ m t-1 (a) + η t γ(p( V m t-1 -V ⋆ (1)) + ( P m t (1|1, a) -p) V t-1 ) = (1 -η t )∆ m t-1 (a) + η t γ(p∆ m t-1,max + ( P m t (1|1, a) -p) V m t-1
). Here in the last line, we used the following relation:
∆ m t,max = max a∈{1,2} ( Q m t (a) -Q ⋆ (1, a)) = max a∈{1,2} Q m t (a) -V ⋆ (1) = V m t-1 -V ⋆ (1), as Q ⋆ (1, 1) = Q ⋆ (1, 2) = V ⋆ (1).
Recursively unrolling the above expression, and using the expression (17), we obtain the following relation: for any t ′ < t when there is no averaging during the interval (t ′ , t)
∆ m t (a) = t k=t ′ +1 (1 -η k ) ∆ m t ′ (a) + t k=t ′ +1 η (t) k γ(p∆ m k-1,max + ( P m k (1|1, a) -p) V m k-1 ). (36
)
For any t ′ , t with t ′ < t, we define the notation
φ t ′ ,t := t k=t ′ +1 (1 -η k ),(37)
ξ m t ′ ,t (a) := t k=t ′ +1 η (t) k γ( P m k (1|1, a) -p) V m k-1 , a = 1, 2;(38)
ξ m t ′ ,t,max := max a∈{1,2} ξ m t ′ ,t (a).(39)
Note that by definition, E[ξ m t ′ ,t (a)] = 0 for a ∈ {1, 2} and all m, t ′ and t. Plugging them into the previous expression leads to the simplified expression
∆ m t (a) = φ t ′ ,t ∆ m t ′ (a) + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t (a).
We specifically choose t ′ and t to be the consecutive averaging instants to analyze the behaviour of ∆ m t across two averaging instants. Consequently, we can rewrite the above equation as
∆ m t (a) = φ t ′ ,t ∆ t ′ (a) + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t (a).(40)
Furthermore, after averaging, we obtain,
∆ t (a) = φ t ′ ,t ∆ t ′ (a) + 1 M M m=1 t k=t ′ +1 η (t) k γp∆ m k-1,max + 1 M M m=1 ξ m t ′ ,t (a).(41)
Step 2: deriving a recursive bound for E[∆ t,max ]. Bounding (40), we obtain,
∆ m t,max ≥ φ t ′ ,t ∆ t ′ ,max + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t,max -φ t ′ ,t |∆ t ′ (1) -∆ t ′ (2)|, (42a) ∆ m t,max ≤ φ t ′ ,t ∆ t ′ ,max + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t,max ,(42b)
where in the first step we used the fact that
max{a 1 + b 1 , a 2 + b 2 } ≥ min{a 1 , a 2 } + max{b 1 , b 2 } = max{a 1 , a 2 } + max{b 1 , b 2 } -|a 1 -a 2 |. (43
)
On taking expectation, we obtain,
E[∆ m t,max ] ≥ φ t ′ ,t E[∆ t ′ ,max ] + t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|],(44a)
E[∆ m t,max ] ≤ φ t ′ ,t E[∆ t ′ ,max ] + t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] + E[ξ m t ′ ,t,max ].(44b)
Similarly, using ( 41) and ( 43) we can write,
∆ t,max ≥ φ t ′ ,t ∆ t ′ ,max + 1 M M m=1 t k=t ′ +1 η (t) k γp∆ m k-1,max -φ t ′ ,t |∆ t ′ (1) -∆ t ′ (2)| + max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) (45a) =⇒ E[∆ t,max ] ≥ φ t ′ ,t E[∆ t ′ ,max ] + 1 M M m=1 t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] + E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) . (45b
)
On combining (44b) and (45b), we obtain,
E[∆ t,max ] ≥ 1 M M m=1 E[∆ m t,max ] -E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] + E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) . (46
)
In order to simplify (46), we make use of the following lemmas. Lemma 3. Let t ′ < t be two consecutive averaging instants. Then for all m ∈ [M ],
E[∆ m t,max ] -E[ξ m t ′ ,t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k -1 + -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|],
where [x] + = max{x, 0}. Lemma 4. For all consecutive averaging instants t ′ , t satisfying t -max{t ′ , τ } ≥ 1/η τ and all m ∈ [M ], we have,
E[ξ m t ′ ,t,max ] ≥ 1 240 log 180B η T (1-γ) • ν ν + 1 , E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) ≥ 1 240 log 180BM η T (1-γ) • ν ν + √ M ,
where ν := 20η T B(1γ) .
Lemma 5. For all t ∈ {t r } R r=1 , we have
E[|∆ t (1) -∆ t (2)|] ≤ 8η T 3BM (1 -γ)
.
Thus, on combining the results from Lemmas 3, 4, and 5 and plugging them into (46), we obtain the following relation for t, t ′ ≥ τ :
E[∆ t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k -1 + -2φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] + E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t(2)
≥ (1 -η τ (1 -γp)) t-t ′ E[∆ t ′ ,max ] +   1 -(1 -η τ (1 -γp)) t-t ′ 5760 log 180B η T (1-γ) (1 -γp)   • ν ν + 1 • 1 t -t ′ ≥ 8 η τ -2(1 -η T ) t-t ′ 8η T 3BM (1 -γ) + 1 240 log 180BM η T (1-γ) • ν ν + √ M • 1 t -t ′ ≥ 8 η τ ,(47)
where we used the relation φ t ′ ,t ≤ (1η T ) t-t ′ , as well as the value of ν as defined in Lemma 4 along with the fact
t k=t ′ +1 η (t) k -1 ≥ 1 -(1 -η τ (1 -γp)) t-t ′ 24(1 -γp)(48)
for all t, t ′ ≥ τ such that tt ′ ≥ 8/η τ .
Proof of (48). We have,
t k=t ′ +1 η (t) k -1 = t k=t ′ +1 η k t i=k+1 (1 -η i (1 -γp)) -1 ≥ t k=t ′ +1 η t t i=k+1 (1 -η τ (1 -γp)) -1 ≥ η t t k=t ′ +1 (1 -η τ (1 -γp)) t-k -1 ≥ η t • 1 -(1 -η τ (1 -γp)) t-t ′ η τ (1 -γp) -1 ≥ 1 -(1 -η τ (1 -γp)) t-t ′ 3(1 -γp) -1. (49
)
To show (48), it is sufficient to show that
1 -(1 -η τ (1 -γp)) t-t ′ 3(1 -γp) ≥ 8 7 for t -t ′ ≥ 8/η τ . Thus, for t -t ′ ≥ 8/η τ we have, 1 -(1 -η τ (1 -γp)) t-t ′ 3(1 -γp) ≥ 1 -exp(-η τ (1 -γp) • (t -t ′ )) 3(1 -γp) ≥ 1 -exp(-8(1 -γp)) 3(1 -γp) . (50
)
Since γ ≥ 5/6, 1γp ≤ 2/9. For x ≤ 2/9, the function f (x) = 1-e -8x 3x ≥ 8/7, proving the claim.
Step 3: lower bounding E[∆ T,max ]. We are now interested in evaluating E[∆ T,max ] based on the recursion (47). To this effect, we introduce some notation to simplify the presentation. Let
R τ := min{r : t r ≥ τ }.(51)
For r = R τ , . . . , R, we define the following terms:
x r := E[∆ tr,max ], α r := (1 -η τ (1 -γp)) tr-tr-1 , β r := (1 -η T ) tr-tr-1 , I r := {r ≥ r ′ > R τ : t r ′ -t r ′ -1 ≥ 8/η τ }, C 1 := 1 5760 log 180B η T (1-γ) (1 -γp) • ν ν + 1 , C 2 := 32η T 3BM (1 -γ) , C 3 := 1 240 log 180BM η T (1-γ) • ν ν + √ M .
With these notations in place, the recursion in (47) can be rewritten as
x r ≥ α r x r-1 -β r C 2 + C 3 1{r ∈ I r } + (1 -α r )C 1 1{r ∈ I r },(52)
for all r ≥ R τ . We claim that x r satisfies the following relation for all r ≥ R τ + 1 (whose proof is deferred to the end of this step):
x r ≥ r i=Rτ +1 α i x Rτ - r k=Rτ +1 β k r i=k+1 α i C 2 + r k=Rτ +1 r i=k+1 α i 1{k ∈ I k }C 3 + C 1   i / ∈Ir α i   1 - i∈Ir α i ,(53)
where we recall that if there is no valid index for a product, its value is taken to be 1.
Invoking (53) for r = R and using the relation x Rτ -1 ≥ 0, we obtain,
x R ≥ - R k=Rτ β k R i=k+1 α i C 2 + R k=Rτ R i=k+1 α i C 3 1{k ∈ I k } + C 1   i / ∈I R α i   1 - i∈I R α i ≥ -RC 2 + C 1   i / ∈I R α i   1 - i∈I R α i ≥ -R • 32η T 3BM (1 -γ) +   i / ∈I R α i   1 - i∈I R α i • 1 5760 log 180B η T (1-γ) (1 -γp) • ν ν + 1 ,(54)
where we used the fact
β k R i=k+1 α i ≤ 1 and that C 3 ≥ 0. Consider the expression i / ∈I R α i = i / ∈I R (1 -η τ (1 -γp)) ti-ti-1 ≥ 1 -η τ (1 -γp) • i / ∈I R (t i -t i-1 ) =:T1 . (55
)
Consequently,
1 - i∈I R α i = 1 -(1 -η τ (1 -γp)) T -τ -T1 ≥ 1 -exp (-η τ (1 -γp) (T -τ -T 1 )) . (56
)
Note that T 1 satisfies the following bound
T 1 := i / ∈I R (t i -t i-1 ) ≤ (R -|I R |) • 8 η τ ≤ 8R η τ . (57
)
We split the remainder of the analysis based on the step size schedule.
• For the constant step size schedule, i.e., η t = η ≥ 1
(1-γ)T , we have, R τ = 0, with τ = 0 and t 0 = 0 (as all agents start at the same point
). If R ≤ 1 96000(1-γ) log( 180B η(1-γ) )
, then, ( 55), ( 56) and ( 57) yield the following relations:
T 1 ≤ 8R η ≤ T 12000 log(180N ) , i / ∈I R α i ≥ 1 -η(1 -γp) • T 1 ≥ 1 - 32R(1 -γ) 3 ≥ 1 - 1 9000 log(180N ) , 1 - i∈I R α i ≥ 1 -exp (-η(1 -γp) (T -T 1 )) ≥ 1 -exp - 4 3 1 - 1 9000 log(180N )
.
On plugging the above relations into (54), we obtain
x R ≥ √ 40 96000 log 180B η(1-γ) (1 -γ) • ν ν + 1 - ν 5 √ M(58)
where recall that ν := 20η 3B(1γ)
. Consider the function f
(x) = x x+1 -x 5 √ M . We claim that for x ∈ [0, √ M ] and all M ≥ 2, f (x) ≥ 7 20 min{x, 1}.(59)
The proof of the above claim is deferred to the end of the section. In light of the above claim, we have,
x R ≥ √ 40 96000 log 180B η(1-γ) (1 -γ) • 7 20 • min 1, 20η 3B(1 -γ) ≥ √ 40 96000 log (180N ) • 7 20 • min 1 1 -γ , 20 3(1 -γ) 4 N ,(60)
where we used the fact that M ≥ 2,
√ x log(1/x) is an increasing function and the relation ν M = 20η 3BM (1 -γ) ≤ 1 15 ≤ 1.
• Next, we consider the rescaled linear step size schedule, where τ = T /3 (cf. ( 34)). To begin, we assume t Rτ ≤ max{ 3T 4 , T -
1 6ητ (1-γp) }. It is straightforward to note that max 3T 4 , T - 1 6η τ (1 -γp) = 3T 4 if c η ≥ 3 T - 1 6ητ (1-γp) if c η < 3. If R ≤ 1 384000(1-γ) log 180B η T (1-γ) •(5+cη)
then, ( 55), ( 56) and ( 57) yield the following relations:
T 1 ≤ 8R η τ , i / ∈I R α i ≥ 1 -η τ (1 -γp) • T 1 ≥ 1 - 32R(1 -γ) 3 ≥ 1 - 1 36000
.
For c η ≥ 3, we have,
1 - i∈I R α i ≥ 1 -exp (-η τ (1 -γp) (T -t Rτ -T 1 )) ≥ 1 -exp - (1 -γ)T (3 + c η (1 -γ)T ) + 32R(1 -γ) 3 ≥ 1 2(3 + c η )
,
where we used T ≥ 1 1-γ in the second step. Similarly, for c η < 3, we have,
1 - i∈I R α i ≥ 1 -exp (-η τ (1 -γp) (T -t Rτ -T 1 )) ≥ 1 -exp - 1 6 + 32R(1 -γ) 3 ≥ 1 10 .
On plugging the above relations into (54), we obtain
x R ≥ 18 √ 1.6 384000 log 180B η T (1-γ) (1 -γ)(5 + c η ) • ν ν + 1 - ν 18 √ M ≥ 18 √ 1.6 384000 log 180B η T (1-γ) (1 -γ)(5 + c η ) • 7 20 • min 1, 20η T 3B(1 -γ) ≥ 18 √ 1.6 384000 log 180B η T (1-γ) (5 + c η ) • 7 20 • min 1 1 -γ , 20η T 3B(1 -γ) 3 ≥ 18 √ 1.6 384000 log (180N (1 + log N )) (5 + log N ) • 7 20 • min 1 1 -γ , 20 3B(1 + log N )(1 -γ) 4 N ,(61)
where we again used the facts that M ≥ 2, c η ≤ log N ,
√ x log(1/x) is an increasing function and the relation ν M = 20η T 3BM (1 -γ) ≤ 1.
• Last but not least, let us consider the rescaled linear step size schedule case when t Rτ > max{ 3T 4 , T -1 6ητ (1-γp) }. The condition implies that the time between the communication rounds R τ -1 and R τ is at least
T 0 := max{ 5T 12 , 2T 3 - 1 6ητ (1-γp) }. Thus, (47) yields that E[∆ t Rτ ] ≥   1 -(1 -η τ (1 -γp)) T0 5760 log 180 Bη T (1-γ) (1 -γp)   • ν ν + 1 -2(1 -η T ) T0 8η T 3BM (1 -γ) .(62)
Using the above relation along with (53), we can conclude that
x R ≥ (1 -η τ (1 -γp)) T -t Rτ   1 -(1 -η τ (1 -γp)) T0 5760 log 180 Bη T (1-γ) (1 -γp)   • ν ν + 1 -2(1 -η T ) T0 • (1 -η τ (1 -γp)) T -t Rτ 8η T 3BM (1 -γ) -RC 2 . (63
)
In the above relation, we used the trivial bounds C 1 , C 3 ≥ 0 and a crude bound on the term corresponding to C 2 , similar to (54). Let us first consider the case of c η ≥ 3. We have,
1 -(1 -η τ (1 -γp)) T0 ≥ 1 -exp (-η τ (1 -γp)5T /12) ≥ 1 -exp - 5(1 -γ)T 3(3 + c η (1 -γ)T ) ≥ 1 3 + c η , (1 -η τ (1 -γp)) T -t Rτ ≥ 1 -η τ (1 -γp) T 4 ≥ 1 - (1 -γ)T (3 + c η (1 -γ)T ) ≥ 1 - 1 c η ≥ 2 3 .
Similarly, for c η < 3, we have,
1 -(1 -η τ (1 -γp)) T0 ≥ 1 -exp -η τ (1 -γp) 2T 3 + 1 6 ≥ 1 -exp - 8(1 -γ)T 3(3 + c η (1 -γ)T ) + 1 6 ≥ 1 -e -5/18 , (1 -η τ (1 -γp)) T -t Rτ ≥ 1 - η τ (1 -γp) 6η τ (1 -γp) ≥ 5 6 .
The above relations implies that (1
-η τ (1 -γp)) T -t Rτ (1 -(1 -η τ (1 -γp)) T0 ) ≥ c
for some constant c, which only depends on c η . On plugging this into (63), we obtain a relation that is identical to that in (54) up to leading constants. Thus, by using a similar sequence of argument as used to obtain (61), we arrive at the same conclusion as for the case of t Rτ ≤ max{ 3T 4 , T -
1 6ητ (1-γp) }.
Step 4: finishing up the proof. Thus, ( 60), ( 61) along with the above conclusion together imply that there exists a numerical constant c 0 > 0 such that
E[| V m T (1) -V ⋆ (1)|] ≥ E[∆ T,max ] ≥ c 0 log 3 N • min 1 1 -γ , 1 (1 -γ) 4 N .(64)
The above equation along with Lemma 2 implies
E[|V m T -V ⋆ (1)|] ≥ c 0 log 3 N • min 1 1 -γ , 1 (1 -γ) 4 N - 1 1 -γ T i=1 (1 -η i (1 -γ)). (65)
On the other hand, from (21) we know that
E[|V m T (3) -V ⋆ (3)|] ≥ 1 1 -γ T i=1 (1 -η i (1 -γ)).(66)
Hence,
E[∥Q m T -Q ⋆ ∥ ∞ ] ≥ E [max {|V m T (3) -V ⋆ (3)|, |V m T (1) -V ⋆ (1)|}] ≥ max {E [|V m T (3) -V ⋆ (3)|] , E [|V m T (1) -V ⋆ (1)|]} ≥ max 1 1 -γ T i=1 (1 -η i (1 -γ)), min 1 1 -γ , 1 (1 -γ) 4 N - 1 1 -γ T i=1 (1 -η i (1 -γ)) ≥ 1 2 min 1 1 -γ , 1 (1 -γ) 4 N ,(67)
where the third step follows from ( 65) and ( 66) and the fourth step uses max{a, b} ≥ (a + b)/2.
Thus, from ( 28) and ( 67) we can conclude that whenever
CC round = O 1 (1-γ) log 2 N , ER(A ; N, M ) = Ω 1 log 3 N √ N
for all values of M ≥ 2. In other words, for any algorithm to achieve any collaborative gain, its communication complexity should satisfy CC round = Ω 1 (1-γ) log 2 N , as required.
Proof of (53). We now return to establish (53) using induction. For the base case, (52) yields
x Rτ +1 ≥ α Rτ +1 x Rτ -β Rτ +1 C 2 + C 3 1{R τ + 1 ∈ I Rτ +1 } + (1 -α Rτ +1 )C 1 1{R τ + 1 ∈ I Rτ +1 }.
(68) Note that this is identical to the expression in (53) for r = R τ + 1 as
  i / ∈I Rτ +1 α i     1 - i∈I Rτ +1 α i   = (1 -α Rτ +1 )1{R τ + 1 ∈ I Rτ +1 }
based on the adopted convention for products with no valid indices. For the induction step, assume (53) holds for some r ≥ R τ + 1. On combining ( 52) and ( 53), we obtain,
x r+1 ≥ α r+1 x r -β r+1 C 2 + C 3 1{(r + 1) ∈ I r+1 } + (1 -α r+1 )C 1 1{r + 1 ∈ I r+1 } ≥ α r+1 r i=Rτ +1 α i x Rτ -α r+1 r k=Rτ +1 β k r i=k+1 α i C 2 + α r+1 r k=Rτ +1 r i=k+1 α i C 3 1{k ∈ I k } + α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i -β r+1 C 2 + C 3 1{(r + 1) ∈ I r+1 } + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 } ≥ r+1 i=Rτ +1 α i x Rτ - r+1 k=Rτ +1 β k r+1 i=k+1 α i C 2 + r+1 k=Rτ +1 r+1 i=k+1 α i C 3 1{k ∈ I k } + α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 }.(69)
If (r + 1) / ∈ I r+1 , then 1 -i∈Ir α i = 1 -i∈Ir+1 α i and α r+1 i / ∈Ir α i = i / ∈Ir+1 α i . Consequently, α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 } = C 1   i / ∈Ir+1 α i     1 - i∈Ir+1 α i   .(70)
On the other hand, if (r + 1) ∈ I r+1 , then i /
∈Ir α i = i /
∈Ir+1 α i . Consequently, we have,
α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 } = α r+1 C 1   i / ∈Ir+1 α i   1 - i∈Ir α i + (1 -α r+1 )C 1 ≥ C 1   i / ∈Ir+1 α i   α r+1 1 - i∈Ir α i + (1 -α r+1 ) ≥ C 1   i / ∈Ir+1 α i     1 - i∈Ir+1 α i   .(71)
Combining ( 69), ( 70) and ( 71) proves the claim.
Proof of (59). To establish this result, we separately consider the cases x ≤ 1 and x ≥ 1.
• When x ≤ 1, we have
f (x) = x x + 1 - 1 5 √ M ≥ x • 1 2 - x 5 √ M ≥ 7x 20 ,(72)
where in the last step, we used the relation M ≥ 2.
• Let us now consider the case x ≥ 1. The second derivative of f is given by f ′′ (x) = -1 2(x+1) 3 . Clearly, for all x ≥ 1, f ′′ < 0 implying that f is a concave function. It is well-known that a continuous, bounded, concave function achieves its minimum values over a compact interval at the end points of the interval (Bauer's minimum principle). For all M ≥ 2, we have,
f (1) = 1 2 - 1 5 √ M ≥ 7 20 ; f ( √ M ) = √ M √ M + 1 - 1 5 ≥ 7 20 .
Consequently, we can conclude that for all
x ∈ [1, √ M ], f (x) ≥ 7 20 .(73)
Combining ( 72) and ( 73) proves the claim.

Section: B.3.3 Large learning rates with large η T


Section: BM
In order to bound the error in this scenario, note that η T BM controls the variance of the stochastic updates in the fixed point iteration. Thus, when η T BM is large, the variance of the iterates is large, resulting in a large error. To demonstrate this effect, we focus on the dynamics of state 2. This part of the proof is similar to the large learning rate case of Li et al. [2023]. For all t ∈ [T ], define:
V t (2) := 1 M M m=1 V m t (2).(74)
Thus, from (33), we know that E[V t (2)] obeys the following recursion:
E[V t (2)] = (1 -η t (1 -γp))E[V t-1 (2)] + η t .
Upon unrolling the recursion, we obtain,
E[V T (2)] = T k=t+1 (1 -η k (1 -γp)) E[V t (2)] + T k=t+1 η (T ) k .
Thus, the above relation along with ( 16) and the value of V ⋆ (2) yields us,
V ⋆ (2) -E[V T (2)] = T k=t+1 (1 -η k (1 -γp)) 1 1 -γp -E[V t (2)] .(75)
Similar to Li et al. [2023], we define
τ ′ := min 0 ≤ t ′ ≤ T -2 E[(V t ) 2 ] ≥ 1 4(1 -γ) 2 for all t ′ + 1 ≤ t ≤ T . If such a τ ′ does not exist, it implies that either E[(V T ) 2 ] < 1 4(1-γ) 2 or E[(V T -1 ) 2 ] < 1 4(1-γ) 2 . If the former is true, then, V ⋆ (2) -E[V T (2)] = 3 4(1 -γ) -E[(V T ) 2 ] > 1 4(1 -γ) . (76
) Similarly, if E[(V T -1 ) 2 ] < 1 4(1-γ) 2 , it implies E[V T -1 ] < 1 2(1-γ)
. Using (33), we have,
E[V T (2)] = (1 -η T (1 -γp))E[V T -1 (2)] + η T ≤ E[V T -1 (2)] + 1 < 1 2(1 -γ) + 1 6(1 -γ) = 2 3(1 -γ) .
Consequently,
V ⋆ (2) -E[V T (2)] > 3 4(1 -γ) - 2 3(1 -γ) > 1 12(1 -γ) . (77
)
For the case when τ ′ exists, we divide the proof into two cases.
• We first consider the case when the learning rates satisfy:
T k=τ ′ +1 (1 -η k (1 -γp)) ≥ 1 2 . (78
)
The analysis for this case is identical to that considered in Li et al. [2023]. We explicitly write the steps for completeness. Specifically,
V ⋆ (2) -E[V T (2)] = T k=τ ′ +1 (1 -η k (1 -γp)) 1 1 -γp -E[V τ ′ (2)] ≥ 1 2 • 3 4(1 -γ) -E[(V τ ′ (2)) 2 ] ≥ 1 2 • 3 4(1 -γ) - 1 2(1 -γ) ≥ 1 8(1 -γ) ,(79)
where the first line follows from (75), the second line from the condition on step sizes and the third line from the definition of τ ′ . • We now consider the other case where,
0 ≤ T k=τ ′ +1 (1 -η k (1 -γp)) < 1 2 . (80
)
Using [Li et al., 2023, Eqn.(134)], for any t ′ < t and all agents m, we have the relation
V m t (2) = 1 1 -γp - t k=t ′ +1 (1 -η k (1 -γp)) 1 1 -γp -V m t ′ (2) + k=t ′ +1 η (t) k γ( P m k (2|2) -p)V m k-1 (2).
The above equation is directly obtained by unrolling the recursion in (24) along with noting that Q t (2, 1) = V t (2) for all t. Consequently, we have,
V T (2) = 1 1 -γp - T k=t ′ +1 (1 -η k (1 -γp)) 1 1 -γp -V t ′ (2) + 1 M M m=1 T k=t ′ +1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2). (81
) Let {F t } T t=0 be a filtration such that F t is the σ-algebra corre- sponding to {{ P m s (2|2)} M m=1 } t s=1 . It is straightforward to note that 1 M M m=1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2)
k is a martingale sequence adapted to the filtration F k . Thus, using the result from [Li et al., 2023, Eqn.(139)], we can conclude that
Var(V T (2)) ≥ E T k=τ ′ +2 Var 1 M M m=1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2) F k-1 . (82
)
We have,
Var 1 M M m=1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2) F k-1 = 1 M 2 M m=1 Var η (T ) k γ( P m k (2|2) -p)V m k-1 (2) F k-1 = (η (T ) k ) 2 BM γ 2 p(1 -p) 1 M M m=1 (V m k-1 (2)) 2 ≥ (1 -γ)(4γ -1) 9BM • (η (T ) k ) 2 • (V k-1 (2)) 2 , (83
)
where the first line follows from that fact that variance of sum of i.i.d. random variables is the sum of their variances, the second line from variance of Binomial random variable and the third line from Jensen's inequality. Thus, ( 82) and ( 83) together yield,
Var(V T (2)) ≥ (1 -γ)(4γ -1) 9BM • T k=τ ′ +2 (η (T ) k ) 2 • E[(V k-1 (2)) 2 ] ≥ (1 -γ)(4γ -1) 9BM • 1 4(1 -γ) 2 • T k=max{τ,τ ′ }+2 (η (T ) k ) 2 , (84
)
where the second line follows from the definition of τ ′ . We focus on bounding the third term in the above relation. We have,
T k=max{τ ′ ,τ }+2 η (T ) k 2 ≥ T k=max{τ ′ ,τ }+2 η k T i=k+1 (1 -η i (1 -γp) 2 ≥ T k=max{τ ′ ,τ }+2 η T t i=k+1 (1 -η τ (1 -γp)) 2 = η 2 T T k=max{τ ′ ,τ }+2 (1 -η τ (1 -γp)) 2(t-k) ≥ η 2 T • 1 -(1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) η τ (1 -γp)(2 -η τ (1 -γp)) ≥ η T • 1 4(1 -γ) • c ′ , (85
)
where the second line follows from monotonicity of η t and the numerical constant c ′ in the fifth step is given by the following claim whose proof is deferred to the end of the section:
1 -(1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) ≥ 1 -e -8/9
for constant step sizes,
1 -exp - 8 3 max{1,cη}
for linearly rescaled step sizes .
(86) Thus, ( 84) and ( 85) together imply
Var(V T (2)) ≥ (4γ -1) 36BM (1 -γ) • T k=τ ′ +2 (η (T ) k ) 2 ≥ c ′ (4γ -1) 144(1 -γ) • η T BM (1 -γ) ≥ c ′ (4γ -1) 144(1 -γ) • 1 100 , (87
)
where the last inequality follows from the bound on η T BM .
Thus, for all N ≥ 1, we have,
E[(V ⋆ (2) -V T (2)) 2 ] = E[(V ⋆ (2) -E[V T (2)]) 2 ] + Var(V T (2)) ≥ c ′′ (1 -γ)N ,
for some numerical constant c ′′ . Similar to the small learning rate case, the error rate is bounded away from a constant value irrespective of the number of agents and the number of communication rounds. Thus, even with CC round = Ω(T ), we will not observe any collaborative gain in this scenario.
Proof of (86). To establish the claim, we consider two cases:
• τ ′ ≥ τ : Under this case, we have,
(1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) = (1 -η τ (1 -γp)) 2(T -τ ′ -1) ≤ (1 -η τ (1 -γp)) T -τ ′ ≤ T k=τ ′ +1 (1 -η k (1 -γp)) ≤ 1 2 , (88
)
where the last inequality follows from (80).
• τ ≥ τ ′ : For this case, we have
(1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) = (1 -η τ (1 -γp)) 2(T -τ -1) ≤ (1 -η τ (1 -γp)) T -τ ≤ exp - 2T η τ (1 -γp) 3 . (89
)
For the constant stepsize schedule, we have,
exp - 2T η τ (1 -γp) 3 ≤ exp - 2T 3 • 1 (1 -γ)T • 4(1 -γ) 3 = exp - 8 9 (90
)
For linearly rescaled stepsize schedule, we have,
exp - 2T η τ (1 -γp) 3 ≤ exp - 2T 3 • 1 1 + c η (1 -γ)T /3 • 4(1 -γ) 3 = exp - 8 3 max{1, c η } (91)
On combining ( 88), ( 89), ( 90) and ( 91), we arrive at the claim.

Section: B.4 Generalizing to larger state action spaces
We now elaborate on how we can extend the result to general state-action spaces along with the obtaining the lower bound on the bit level communication complexity. For the general case, we instead consider the following MDP. For the first four states {0, 1, 2, 3}, the probability transition kernel and reward function are given as follows.
A 0 = {1} P (0|0, 1) = 1 r(0, 1) = 0, (92a)
A 1 = {1, 2, . . . , |A|} P (1|1, a) = p P (0|1, a) = 1 -p r(1, a) = 1, ∀ a ∈ A (92b) A 2 = {1} P (2|2, 1) = p P (0|2, 1) = 1 -p r(2, 1) = 1, (92c) A 3 = {1} P (3|3, 1) = 1 r(3, 1) = 1,(92d)
where the parameter p = 4γ -1 3γ . The overall MDP is obtained by creating |S|/4 copies of the above MDP for all sets of the form {4r, 4r +1, 4r +2, 4r +3} for r ≤ |S|/4-1. It is straightforward to note that the lower bound on the number of communication rounds immediately transfers to the general case as well. Moreover, note that the bound on CC round implies the bound
CC bit = Ω 1 (1-γ) log 2 N
as every communication entails sending Ω(1) bits.
To obtain the general lower bound on bit level communication complexity, note that we can carry out the analysis in the previous section for all |A|/2 pairs of actions in state 1 corresponding to the set of states {0, 1, 2, 3}. Moreover, the algorithm A , needs to ensure that the error is low across all the |A|/2 pairs. Since the errors are independent across all these pairs, each of them require Ω 1 (1-γ) log 2 N bits of information to be transmitted during the learning horizon leading to a lower bound of Ω

Section: |A|
(1-γ) log 2 N . Note that since we require a low ℓ ∞ error, A needs to ensure that the error is low across all the pairs, resulting in a communication cost linearly growing with |A|. Upon repeating the argument across all |S|/4 copies of the MDP, we arrive at the lower bound of
CC bit = Ω |S||A| (1-γ) log 2 N .

Section: B.5 Proofs of auxiliary lemmas B.5.1 Proof of Lemma 2
Note that a similar relationship is also derived in Li et al. [2023], but needing to take care of the averaging over multiple agents, we present the entire arguments for completeness. We prove the claim using an induction over t. It is straightforward to note that the claim is true for t = 0 and all agents m ∈ {1, 2, . . . , M }. For the inductive step, we assume that the claim holds for t -1 for all clients. Using the induction hypothesis, we have the following relation between V m t-1 (1) and V m t-1 :
V m t-1 (1) = max a∈{1,2} Q m t-1 (1, a) ≥ max a∈{1,2} Q m t-1 (a) - 1 1 -γ t-1 i=1 (1 -η i (1 -γ)) = V m t-1 - 1 1 -γ t-1 i=1 (1 -η i (1 -γ)). (93) For t / ∈ {t r } R r=1 and a ∈ {1, 2}, we have, Q m t (1, a) -Q m t (a) = Q m t-1/2 (1, a) -Q m t-1/2 (a) = (1 -η t )Q m t-1 (1, a) + η t (1 + γ P m t (1|1, a)V m t-1 (1)) -(1 -η t ) Q m t-1 (a) + η t (1 + γ P m t (1|1, a) V m t-1 ) = (1 -η t )(Q m t-1 (1|1, a) -Q m t-1 (a)) + η t γ P m t (1|1, a)(V m t-1 (1) -V m t-1 ) ≥ - (1 -η t ) 1 -γ t-1 i=1 (1 -η i (1 -γ)) -P m t (1|1, a) • η t γ 1 -γ t-1 i=1 (1 -η i (1 -γ)) ≥ - (1 -η t ) 1 -γ t-1 i=1 (1 -η i (1 -γ)) - η t γ 1 -γ t-1 i=1 (1 -η i (1 -γ)) ≥ - 1 1 -γ t i=1 (1 -η i (1 -γ)). (94
)
For t ∈ {t r } R r=1 and a ∈ {1, 2}, we have,
Q m t (1, a) -Q m t (a) = 1 M M m=1 Q m t-1/2 (1, a) - 1 M M m=1 Q m t-1/2 (a) = 1 M M m=1 (1 -η t )Q m t-1 (1, a) + η t (1 + γ P m t (1|1, a)V m t-1 (1)) - 1 M M m=1 (1 -η t ) Q m t-1 (a) + η t (1 + γ P m t (1|1, a) V m t-1 ) = 1 M M m=1 (1 -η t )(Q m t-1 (1, a) -Q m t-1 (a)) + η t γ P m t (1|1, a)(V m t-1 (1) -V m t-1 ) ≥ - 1 1 -γ t i=1 (1 -η i (1 -γ)), (95
)
where the last step follows using the same set of arguments as used in (94). The inductive step follows from ( 94) and (95).

Section: B.5.2 Proof of Lemma 3
In order to bound the term E[∆ m t,max ] -E[ξ m t ′ ,t,max ], we make use of the relation in ( 44a), which we recall
E[∆ m t,max ] ≥ φ t ′ ,t E[∆ t ′ ,max ] + t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|].
• To aid the analysis, we consider the following recursive relation for any fixed agent m:
y t = (1 -η t )y t-1 + η t (γpy t-1 + E[ξ m t ′ ,t,max ]). (96
)
Upon unrolling the recursion, we obtain,
y t = t k=t ′ +1 (1 -η k ) y t ′ + t k=t ′ +1 η k t i=k+1 (1 -η i ) γpy k-1 + t k=t ′ +1 η k t i=k+1 (1 -η i ) E[ξ m t ′ ,t,max ] = φ t ′ ,t y t ′ + t k=t ′ +1 η (t) k γpy k-1 + t k=t ′ +1 η (t) k E[ξ m t ′ ,t,max ]. (97
)
Initializing 97) and plugging this into (44a), we have
y t ′ = E[∆ t ′ ,max ] in (
E[∆ m t,max ] ≥ y t -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|],
where we used
t k=t ′ +1 η (t)
k ≤ 1 (cf. ( 18)). We now further simply the expression of y t . By rewriting (96) as
y t = (1 -η t (1 -γp))y t-1 + η t E[ξ m t ′ ,t,max ],
it is straight forward to note that y t is given as
y t = t k=t ′ +1 (1 -η k (1 -γp)) y t ′ + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k . (98
)
Consequently, we have,
E[∆ m t,max ] -E[ξ m t ′ ,t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k -1 -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|].(99)
• We can consider a slightly different recursive sequence defined as
w t = (1 -η t )w t-1 + η t (γpw t-1 ). (100
)
Using a similar sequence of arguments as outlined in ( 96)-( 98), we can conclude that if
w t ′ = E[∆ t ′ ,max ], then E[∆ m t,max ] ≥ w t + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] and consequently, E[∆ m t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|].(101)
On combining ( 99) and ( 101), we arrive at the claim.

Section: B.5.3 Proof of Lemma 4
We begin with bounding the first term E[ξ m t ′ ,t,max ]; the second bound follows in an almost identical derivation.
Step 1: applying Freedman's inequality. Using the relation max{a, b} = a+b+|a-b| 2 , we can rewrite E[ξ m t ′ ,t,max ] as
E[ξ m t ′ ,t,max ] = E ξ m t ′ ,t (1) + ξ m t ′ ,t (2) 2 + ξ m t ′ ,t (1) -ξ m t ′ ,t (2) 2 = 1 2 E ξ m t ′ ,t (1) -ξ m t ′ ,t (2) 2 = 1 2 E t k=t ′ +1 η (t) k γ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 =:ζ m t ′ ,t ,(102)
where we used the definition in ( 38) and the fact that
E[ξ m t ′ ,t (1)] = E[ξ m t ′ ,t (2)] = 0. Decompose ζ m t ′ ,tas
ζ m t ′ ,t = t k=t ′ +1 B b=1 η (t) k γ B (P m k,b (1|1, 1) -P m k,b (1|1, 2)) V m k-1 =: L l=1 z l ,(103)
where for all
1 ≤ l ≤ L z l := γ B (P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2)) V m k(l)-1 with k(l) := ⌊l/B⌋ + t ′ + 1; b(l) = ((l -1) mod B) + 1; L = (t -t ′ )B. Let {F l } L l=1 be a filtration such that F l is the σ-algebra corresponding to {P m k(j),b(j) (1|1, 1), P m k(j),b(j) (1|1, 2)} l j=1 .
It is straightforward to note that {z l } L l=1 is a martingale sequence adapted to the filtration {F } L l=1 . We will use the Freedman's inequality [Freedman, 1975, Li et al., 2023] to obtain a high probability bound on |ζ m t ′ ,t |.
• To that effect, note that
sup l |z l | ≤ sup l η (t) k(l) • γ B • (P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2)) • V m k(l)-1 ≤ η (t) k(l) • γ B(1 -γ) ≤ η t B(1 -γ) ,(104)
where the second step follows from the bounds
|(P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2))| ≤ 1 and V m k(l)-1 ≤ 1 1-γ and the third step uses c η ≤ 1 1-γ and the fact that η (T ) k is increasing in k in this regime. (cf. (19)). • Similarly, Var(z l |F l-1 ) ≤ η (t) k(l) 2 γ 2 B 2 • V m k(l)-1 2 • Var(P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2)) ≤ η (t) k(l) 2 γ 2 B 2 (1 -γ) 2 • 2p(1 -p) ≤ 2 η (t) k(l) 2 3B 2 (1 -γ) .(105)
Using the above bounds ( 104) and ( 105) along with Freedman's inequality yield that
Pr   |ζ m t ′ ,t | ≥ 8 log(2/δ) 3B 2 (1 -γ) L l=1 η (t) k(l) 2 + 4η t log(2/δ) 3B(1 -γ)   ≤ δ.(106)
Setting
δ 0 = (1-γ) 2 2 • E[|ζ m t ′ ,t | 2 ]
, with probability at least 1δ 0 , it holds
|ζ m t ′ ,t | ≥ 8 log(2/δ 0 ) 3B(1 -γ) t k=t ′ +1 η (t) k 2 + 4η t log(2/δ 0 ) 3B(1 -γ) =: D.(107)
Consequently, plugging this back to (102), we obtain
E[ξ m t ′ ,t,max ] = 1 2 E[|ζ m t ′ ,t |] ≥ 1 2 E[|ζ m t ′ ,t |1{|ζ m t ′ ,t | ≤ D}] ≥ 1 2D E[|ζ m t ′ ,t | 2 1{|ζ m t ′ ,t | ≤ D}] ≥ 1 2D E[|ζ m t ′ ,t | 2 ] -E[|ζ m t ′ ,t | 2 1{|ζ m t ′ ,t | > D}] ≥ 1 2D E[|ζ m t ′ ,t | 2 ] - Pr(|ζ m t ′ ,t | > D) (1 -γ) 2 ≥ 1 4D • E[|ζ m t ′ ,t | 2 ].(108)
Here, the penultimate step used the fact that
|ζ m t ′ ,t | ≤ t k=t ′ +1 η (t) k (1 -γ) ≤ 1 (1 -γ)
, and the last step used the definition of δ 0 . Thus, it is sufficient to obtain a lower bound on
E[|ζ m t ′ ,t | 2 ] in order obtain a lower bound for E[ξ m t ′ ,t,max ].
Step 2: lower bounding
E[|ζ m t ′ ,t | 2 ].
To proceed, we introduce the following lemma pertaining to lower bounding V m t that will be useful later. Lemma 6. For all time instants t ∈ [T ] and all agent m ∈ [M ]:
E V m t 2 ≥ 1 2(1 -γ) 2 .
We have,
E[|ζ m t ′ ,t | 2 ] = E L l=1 Var (z l |F l-1 ) = E L l=1 E z 2 l |F l-1 ≥ L l=1 η (t) k(l) 2 γ 2 B 2 • 2p(1 -p) • E V m k(l)-1 2 ≥ L l=1 η (t) k(l) 2 γ 2 B 2 • 2p(1 -p) • 1 2(1 -γ) 2 ≥ 2 9B(1 -γ) • t k=max{t ′ ,τ }+1 η (t) k 2 ,(109)
where the third line follows from Lemma 6 and the fourth line uses γ ≥ 5/6.
Step 3: finishing up. We finish up the proof by bounding
t k=max{t ′ ,τ }+1 η (t) k 2 for t - max{t ′ , τ } ≥ 1/η τ . We have t k=max{t ′ ,τ }+1 η (t) k 2 ≥ t k=max{t ′ ,τ }+1 η k t i=k+1 (1 -η i ) 2 (i) ≥ t k=max{t ′ ,τ }+1 η t t i=k+1 (1 -η τ ) 2 = η 2 t t k=max{t ′ ,τ }+1 (1 -η τ ) 2(t-k) ≥ η 2 t • 1 -(1 -η τ ) 2(t-max{t ′ ,τ }) η τ (2 -η τ ) ≥ η t • 1 -exp(-2) 6 ≥ η t 10 ≥ η T 10 ,(110)
where (i) follows from the monotonicity of η k . Plugging (110) into the expressions of D (cf. ( 107)) we have
D = 8 log(2/δ 0 ) 3B(1 -γ) t k=t ′ +1 η (t) k 2 + 4η t log(2/δ 0 ) 3B(1 -γ) ≤ 9 2 E[|ζ m t ′ ,t | 2 ] • 8 log(2/δ 0 ) 3 1 B(1 -γ) t k=t ′ +1 η (t) k 2 -1/2 + 60 • E[|ζ m t ′ ,t | 2 ] • log(2/δ 0 ) ≤ 3E[|ζ m t ′ ,t | 2 ] • log(2/δ 0 ) 60B(1 -γ) η t + 20 ≤ 60E[|ζ m t ′ ,t | 2 ] • log(2/δ 0 ) 3B(1 -γ) 20η T + 1 ,
where the second line follows from ( 109) and ( 110), and the third line follows from (110). On combining the above bound with (108), we obtain,
E[ξ m t ′ ,t,max ] ≥ 1 240 log(2/δ 0 ) • ν ν + 1 ,(111)
where ν := 20η T 3B(1γ)
. Note that we have,
δ 0 = (1 -γ) 2 2 • E[|ζ m t ′ ,t | 2 ] ≥ (1 -γ) 9B • t k=t ′ +1 η (t) k 2 ≥ η T (1 -γ) 90B .
Combining the above bound with (111) yields us the required bound.
Step 4: repeating the argument for the second claim. We note that second claim in the theorem, i.e., the lower bound on
E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t(2
) follows through an identical series of arguments where the bounds in Eqns. ( 104) and ( 105) contain an additional factor of M in the denominator (effectively replacing B with BM ), which is carried through in all the following steps.

Section: B.5.4 Proof of Lemma 5
Using Eqns. ( 41) and (38), we can write
∆ t (1) -∆ t (2) = t k=t ′ +1 (1 -η k ) (∆ t ′ (1) -∆ t ′ (2)) + 1 M M m=1 t k=t ′ +1 η k t i=k+1 (1 -η i ) γ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 .
Upon unrolling the recursion, we obtain,
∆ t (1) -∆ t (2) = t k=1 M m=1 η k t i=k+1 (1 -η i ) γ M ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 .
If we define a filtration F k as the σ-algebra corresponding to { P 1 l (1|1, 1), P 1 l (1|1, 2), . . . , P M l (1|1, 1), P M l (1|1, 2)} k l=1 , then it is straightforward to note that {∆ t (1) -∆ t (2)} t is a martingale sequence adapted to the filtration {F t } t . Using Jensen's inequality, we know that if {Z t } t is a martingale adapted to a filtration {G t } t , then for a convex function
f such that f (Z t ) is integrable for all t, {f (Z t )} t is a sub-martingale adapted to {G t } t . Since f (x) = |x| is a convex function, {|∆ t (1) -∆ t (2)|} t is a submartingale adapted to the filtration {F t } t . As a result, sup 1≤t≤T E[|∆ t (1) -∆ t (2)|] ≤ E[|∆ T (1) -∆ T (2)|] ≤ E[(∆ T (1) -∆ T (2)) 2 ] 1/2 . (112
)
We use the following observation about a martingale sequence {X i } t i=1 adapted to a filtration {G i } t i=1 to evaluate the above expression. We have,
E   t i=1 X i 2   = E   E   t i=1 X i 2 G t-1     = E   E   X 2 t + 2X t t-1 i=1 X i + t-1 i=1 X i 2 G t-1     = E X 2 t + E   t-1 i=1 X i 2   = t i=1 E X 2 i ,(113)
where the third step uses the facts that t-1 i=1 X i is G t-1 measure and E[X t |G t-1 ] = 0 and fourth step is obtained by recursively applying second and third steps. Using the relation in Eqn. (113) in Eqn. (112), we obtain,
sup 1≤t≤T E[|∆ t (1) -∆ t (2)|] ≤ E[(∆ T (1) -∆ T (2)) 2 ] 1/2 ≤   T k=1 E   M m=1 η (T ) k • γ M • ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 2     1/2 ≤ T k=1 η (T ) k 2 • 2γ 2 p(1 -p) BM 2 • M m=1 E V m k-1 2 1/2 ≤ T k=1 η (T ) k 2 • 2γ 2 p(1 -p) BM (1 -γ) 2 1/2 . (114
)
Let us focus on the term involving the step sizes. We separately consider the scenario for constant step sizes and linearly rescaled step sizes. For constant step sizes, we have,
T k=1 η (T ) k 2 = T k=1 η k T i=k+1 (1 -η i ) 2 = T k=1 η 2 (1 -η) 2(T -k) ≤ η 2 1 -(1 -η) 2 ≤ η. (115)
Similarly, for linearly rescaled step sizes, we have,
T k=1 η (T ) k 2 = τ k=1 η (T ) k 2 + T k=τ +1 η k T i=k+1 (1 -η i ) 2 ≤ τ k=1 η (T ) τ 2 + T k=τ +1 η 2 k (1 -η T ) 2(T -k) ≤ η 2 τ (1 -η T ) 2(T -τ ) • τ + η 2 τ • 1 η T (2 -η T ) ≤ 3η T • η T • T • exp - 4T η T 3 + 3η T ≤ 9 4e η T + 3η T ≤ 4η T ,(116)
where the second step uses c η ≤ log N ≤ 1 1-γ and the fact that η
(T ) k
is increasing in k in this regime. (See Eqn. ( 19)) and fifth step uses xe -4x/3 ≤ 3/4e. On plugging results from Eqns. ( 115) and ( 116) into Eqn. ( 114) along with the value of p, we obtain,
sup 1≤t≤T E[|∆ t (1) -∆ t (2)|] ≤ 8η T 3BM (1 -γ) ,(117)
as required.

Section: B.5.5 Proof of Lemma 6
For the proof, we fix an agent m. In order to obtain the required lower bound on V m t , we define an auxiliary sequence Q m t that evolves as described in Algorithm 5. Essentially, Q m t evolves in a manner almost identical to Q m t except for the fact that there is only one action and hence there is no maximization step in the update rule.
Algorithm 5: Evolution of Q
1: r ← 1, Q m 0 = Q ⋆ (1, 1
) for all m ∈ {1, 2, . . . , M } 2: for t = 1, 2, . . . , T do 3:
for m = 1, 2, . . . , M do 4:
Q m t-1/2 ← (1 -η t )Q m t-1 (a) + η t (1 + P m t (1|1, 1)Q m t-1 ) 5:
Compute Q m t according to Eqn. (8) 6:
end for 7: end for
It is straightforward to note that Q m t (1) ≥ Q m t
, which can be shown using induction. From the initialization, it follows that Q m 0 (1) ≥ Q m 0 . Assuming the relation holds for t -1, we have,
Q m t-1/2 (1) = (1 -η t ) Q m t-1 (1) + η t (1 + γ P m t (1|1, 1) V m t-1 ) ≥ (1 -η t ) Q m t-1 (1) + η t (1 + γ P m t (1|1, 1) Q m t-1 (1)) ≥ (1 -η t )Q m t-1 + η t (1 + γ P m t (1|1, 1)Q m t-1 ) = Q m t-1/2 . Since Q m t and Q m t
follow the same averaging schedule, it immediately follows from the above relation that
Q m t (1) ≥ Q m t . Since V m t ≥ Q m t (1) ≥ Q m t
, we will use the sequence Q m t to establish the required lower bound on V m t . We claim that for all time instants t and all agents m,
E[Q m t ] = 1 1 -γp .(118)
Assuming ( 118) holds, we have
E[( V m t ) 2 ] ≥ E[ V m t ] 2 ≥ E[Q m t ] 2 ≥ 1 1 -γp 2 ≥ 1 2(1 -γ) 2 ,
as required. In the above expression, the first inequality follows from Jensen's inequality, the second from the relation V m t ≥ Q m t ≥ 0 and the third from (118). We now move now to prove the claim (118) using induction. For the base case, E[Q
m 0 ] = 1 1-γp holds by choice of initialization. Assume that E[Q m t-1 ] = 1
1-γp holds for some t -1 for all m.
• If t is not an averaging instant, then for any client m,
Q m t = (1 -η t )Q m t-1 + η t (1 + γ P m t (1|1, 1)Q m t-1 ) =⇒ E[Q m t ] = (1 -η t )E[Q m t-1 ] + η t (1 + γE[ P m t (1|1, 1)Q m t-1 ]) = (1 -η t )E[Q m t-1 ] + η t (1 + γpE[Q m t-1 ]) = (1 -η t ) 1 -γp + η t 1 + γp 1 -γp = 1 1 -γp . (119
)
The third line follows from the independence of P m t (1|1, 1) and Q m t-1 and the fourth line uses the inductive hypothesis. • If t is an averaging instant, then for all clients m,
Q m t = (1 -η t ) M M j=1 Q j t-1 + η t 1 M M j=1
(1 + γ P j t (1|1, 1)
Q j t-1 ) =⇒ E[Q m t ] = (1 -η t ) M M j=1 E[Q j t-1 ] + η t 1 M M j=1 (1 + γE[ P j t (1|1, 1)Q j t-1 ]) = (1 -η t ) M M j=1 1 1 -γp + η t 1 M M j=1 1 + γp 1 -γp = 1 1 -γp ,(120)
where we again make use of independence and the inductive hypothesis.
Thus, ( 119) and ( 120) taken together complete the inductive step.

Section: C Analysis of Fed-DVR-Q
In this section, we prove Theorem 2 that outlines the performance guarantees of Fed-DVR-Q. There are two main parts of the proof. The first part deals with establishing that for the given choice of parameters described in Section 4.1.3, the output of the algorithm is an ε-optimal estimate of Q ⋆ with probability 1δ. The second part deals with deriving the bounds on the sample and communication complexity based on the choice of prescribed parameters. We begin with the second part, which is easier of the two.

Section: C.1 Establishing the sample and communication complexity bounds
Establishing the communication complexity. We begin with bounding CC round . From the description of Fed-DVR-Q, it is straightforward to note that each epoch, i.e., each call to the REFINEES-TIMATE routine, involves I + 1 rounds of communication, one for estimating T Q and the remaining ones during the iterative updates of the Q-function. Since there are a total of K epochs,
CC round (Fed-DVR-Q; ε, M, δ) ≤ (I + 1)K ≤ 16 η(1 -γ) log 2 1 (1 -γ)ε ,
where the second bound follows from the prescribed choice of parameters in Sec. 4.1.3. Similarly, since the quantization step is designed to compress each coordinate into J bits, each message transmitted by an agent has a size of no more than J • |S||A| bits. Consequently,
CC bit (Fed-DVR-Q; ε, M, δ) ≤ J • |S||A| • CC round (Fed-DVR-Q; ε, M, δ) ≤ 32|S|A| η(1 -γ) log 2 1 (1 -γ)ε log 2 70 η(1 -γ) 4 M log 8KI|S||A| δ ,
where once again in the second step we plugged in the choice of J from Sec. 4.1.3.
Lemma 7. Let δ ∈ (0, 1). Consider the REFINEESTIMATE routine described in Algorithm 3 and let Q ⋆ H denote the fixed point of the operator H defined in (122) for some fixed Q. Then the iterates generated by REFINEESTIMATE Q I satisfy
∥Q I -Q ⋆ H ∥ ∞ ≤ 1 6 ∥Q -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H ∥ ∞ + D 70
with probability 1 -δ 2K . Lemma 8. Consider the REFINEESTIMATE routine described in Alg. 3 and let Q ⋆ H denote the fixed point of the operator H defined in Eqn. (122) for a fixed Q. The following relation holds with probability 1 -δ 2K :
∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ ∥Q -Q ⋆ ∥ ∞ • 16κ ′ L(1 -γ) 2 + 64κ ′ L(1 -γ) 3 + 2κ ′ √ 2 3L(1 -γ) 2 + D 70 , whenever L ≥ 32κ ′ , where κ ′ = log 12K|S||A| δ .
Step 2: establishing the linear contraction. We now leverage the above lemmas to establish the desired contraction in (121). Instantiating the operator (122) at each k-th epoch by setting -1) and L := L k , we define
Q := Q (k
H k (Q) := T (Q) -T (Q (k-1) ) + T L k (Q (k-1) ),(124)
whose fixed point is denoted as Q ⋆ H k . Using the results from Lemmas 7 and 8 with D := D k and H = H k , we obtain
∥Q (k) -Q ⋆ ∥ ∞ ≤ ∥Q (k) -Q ⋆ H k ∥ ∞ + ∥Q ⋆ H -Q ⋆ H k ∥ ∞ ≤ 1 6 ∥Q (k-1) -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H k ∥ ∞ + D k 70 + ∥Q ⋆ H k -Q ⋆ ∥ ∞ = 1 6 ∥Q (k-1) -Q ⋆ ∥ ∞ + 7∥Q ⋆ -Q ⋆ H k ∥ ∞ + D k 70 ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 64κ ′ L k (1 -γ) 3 + 2 √ 2κ ′ 3L k (1 -γ) 2 + 13D k 420 ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 100κ ′ L k (1 -γ) 3 + 13D k 420 ,(125)
holds with probability 1 -δ K . Here, we invoke Lemma 7 in the second step and Lemma 8 in the fourth step corresponding to the REFINEESTIMATE routine during the k-th epoch. In the last step, we used the fact that
L k (1-γ) 2 κ ′ ≥ 1.
We now use induction along with the recursive relation in (125) to establish the required claim (121). Let us first consider the case 0 ≤ k ≤ K 0 . The base case, ∥Q (0) -Q ⋆ ∥ ∞ ≤ 1 1-γ , holds by definition. Let us assume the relation holds for k -1. Then, from (125) and choice of L k (Sec. 4.1.3), we have
∥Q (k) -Q ⋆ ∥ ∞ ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 100κ ′ L k (1 -γ) 3 + 13D k 420 ≤ 2 -(k-1) 1 -γ 1 6 + 2 -k • 7 6 8 19600 + 2 -k • 7 6 50 19600(1 -γ) + 104 420 • 2 -(k-1) 1 -γ ≤ 2 -(k-1) 1 -γ 1 6 + 7 6 91 39200 + 1 4 ≤ 2 -k 1 -γ . (126
)
Now we move to the second case, for k > K 0 . From (125) and choice of L (Sec. 4.1.3), we have
∥Q (k) -Q ⋆ ∥ ∞ ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 100κ ′ L k (1 -γ) 3 + 13D k 420 ≤ 2 -(k-1) 1 -γ 1 6 + 2 -(k-K0) • 7 6 8 19600 + 2 -(k-K0) • 7 6 50 19600(1 -γ) + 104 420 • 2 -(k-1) 1 -γ ≤ 2 -(k-1) 1 -γ 1 6 + 7 6 1 196 + 1 4 ≤ 2 -k 1 -γ . (127
)
By a union bound argument, we can conclude that the relation
∥Q (k) -Q ⋆ ∥ ∞ ≤ 2 -k
1-γ holds for all k ≤ K with probability at least 1δ.
Step 3: confirm the compressor bound. The only thing left to verify is that the inputs to the compressor are always bounded by D k during the k-th epoch, for all 1 ≤ k ≤ K. The following lemma provides a bound on the input to the compressor during any run of the REFINEESTIMATE routine. Lemma 9. Consider the REFINEESTIMATE routine described in Algorithm 3 with some for some fixed Q. For all i ≤ I and all agents m, the following bound holds with probability 1 -δ 2K :
∥Q m i-1 2 -Q i-1 ∥ ∞ ≤ η∥Q -Q ⋆ H ∥ ∞ 7 6 • (1 + γ) + 2γ + ηD(1 + γ) 70 .
For the k-th epoch, it follows that
η∥Q (k-1) -Q ⋆ H k ∥ ∞ 7 6 • (1 + γ) + 2γ + ηD k (1 + γ) 70 ≤ 13 3 ∥Q (k-1) -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H k ∥ ∞ + D k (1 + γ) 70 ≤ 13 3 • 15 14 • ∥Q (k-1) -Q ⋆ ∥ ∞ + 2D k 70 ≤ 195 42 + 16 70 • 2 -(k-1) 1 -γ ≤ 8 • 2 -(k-1) 1 -γ := D k .
In the third step, we used the same sequence of arguments as used in ( 126) and ( 127) and, in the fourth step, we used the bound on Let us begin with analyzing the evolution of the sequence {Q i } I i=1 during a run of the REFINEESTI-MATE routine. The sequence {Q i } I i=1 satisfies the following recursion:
∥Q (k-1) -Q ⋆ ∥ ∞ from(
Q i = Q i-1 + 1 M M m=1 C Q m i-1 2 -Q i-1 ; D, J = Q i-1 + 1 M M m=1 Q m i-1 2 -Q i-1 + ζ m i = 1 M M m=1 Q m i-1 2 + ζ m i = (1 -i-1 + η M M m=1 H (m) i (Q i-1 ) + 1 M M m=1 ζ m i =:ζi . (128
)
In the above expression, ζ m i denotes the quantization noise introduced at agent m in the i-th update. Subtracting Q ⋆ H from both sides of (128), we obtain
Q i -Q ⋆ H = (1 -η)(Q i-1 -Q ⋆ H ) + η M M m=1 H (m) i (Q i-1 ) -Q ⋆ H + ζ i = (1 -η)(Q i-1 -Q ⋆ H ) + η M M m=1 H (m) i (Q i-1 ) -H (m) i (Q ⋆ H ) + η M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) + ζ i . (129)
Consequently,
∥Q i -Q ⋆ H ∥ ∞ ≤ (1 -η)∥Q i-1 -Q ⋆ H ∥ ∞ + η M M m=1 H (m) i (Q i-1 ) -H (m) i (Q ⋆ H ) ∞ + η M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) ∞ + ∥ζ i ∥ ∞ ,(130)
which we shall proceed to bound each term separately.
• Regarding the second term, it follows that
H (m) i (Q) -H (m) i (Q ⋆ H ) ∞ = T (m) i (Q) -T (m) i (Q ⋆ H ) ∞ ≤ γ ∥Q -Q ⋆ H ∥ ∞ ,(131)
which holds for all Q since T (m) i is a γ-contractive operator. • Regarding the third term, notice that
1 M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) = 1 M B M m=1 z∈Z (m) i T z (Q ⋆ H ) -T z (Q) -T (Q ⋆ H ) + T (Q) . Note that T z (Q ⋆ H ) -T z (Q) -T (Q ⋆ H ) + T (Q) is a zero-mean random vector satisfying ∥T z (Q ⋆ H ) -T z (Q) -T (Q ⋆ H ) + T (Q)∥ ∞ ≤ 2γ∥Q -Q ⋆ H ∥ ∞ .(132)
Thus, each of its coordinate is a (2γ∥Q -Q ⋆ H ∥ ∞ ) 2 -sub-Gaussian vector. Applying the tail bounds for a maximum of sub-Gaussian random variables [Vershynin, 2018], we obtain that
1 M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) ∞ ≤ 2γ∥Q -Q ⋆ H ∥ ∞ • 2 M B log 8KI|S||A| δ(133)
holds with probability at least 1 -δ 4KI . • Turning to the last term, by the construction of the compression routine described in Section 4.1.2, it is straightforward to note that ζ m i is a zero-mean random vector whose coordinates are independent, D 2 • 4 -J -sub-Gaussian random variables. Thus, ζ i is also a zero-mean random vector whose coordinates are independent, D 2 M •4 J -sub-Gaussian random variables. Hence, we can similarly conclude that
∥ζ i ∥ ∞ ≤ D • 2 -J • 2 M log 8KI|S||A| δ(134)
holds with probability at least 1 -δ 4KI .
Combining the above bounds into (130), and introducing the short-hand notation κ := log 8KI|S||A| δ , we obtain with probability at least 1 -δ 2KI ,
∥Q i -Q ⋆ H ∥ ∞ ≤ (1 -η(1 -γ))∥Q i-1 -Q ⋆ H ∥ ∞ + 2ηγ∥Q -Q ⋆ H ∥ ∞ • 2κ M B + D • 2 -J • 2κ M .
Unrolling the above recursion over i = 1, . . . , I yields the following relation, which holds with probability at least 1 -δ 2K :
∥Q I -Q ⋆ H ∥ ∞ ≤ (1 -η(1 -γ)) I ∥Q 0 -Q ⋆ H ∥ ∞ + 2κ M 2ηγ √ B ∥Q -Q ⋆ H ∥ ∞ + D • 2 -J • I i=1 (1 -η(1 -γ)) I-i ≤ (1 -η(1 -γ)) I ∥Q -Q ⋆ H ∥ ∞ + 1 η(1 -γ) 2κ M 2ηγ √ B ∥Q -Q ⋆ H ∥ ∞ + D • 2 -J ≤ ∥Q -Q ⋆ H ∥ ∞ (1 -η(1 -γ)) I + 2γ (1 -γ) 2κ M B + D • 2 -J η(1 -γ) • 2κ M (135) ≤ ∥Q -Q ⋆ H ∥ ∞ 6 + D 70 ≤ 1 6 ∥Q -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H ∥ ∞ + D 70 .(136)
Here, the fourth step is obtained by plugging in the prescribed values of B, I and J in Sec. 4.1.3.

Section: C.3.2 Proof of Lemma 8
Intuitively, the error
∥Q ⋆ H -Q ⋆ ∥ ∞ depends on the error term T L (Q) -T (Q).
If the latter is small, then H(Q) is close to T (Q) and consequently so are Q ⋆ H and Q ⋆ . Thus, we begin with bounding the term T L (Q) -T (Q). We have,
T L (Q) -T (Q) = Q + 1 M M m=1 C T (m) L (Q) -Q -T (Q) = 1 M M m=1 T (m) L (Q) + ζ(m) L -T (Q) = 1 M M m=1 T (m) L (Q) -T (m) L (Q ⋆ ) -T (Q) + T (Q ⋆ ) + 1 M M m=1 ζ(m) L + 1 M M m=1 T (m) L (Q ⋆ ) -T (Q ⋆ ) ,(137)
where once again ζ(m)
L := T (m) L (Q) -Q -C T (m) L
(Q) -Q denotes the quantization error at agent m. Similar to the arguments of ( 133) and ( 134), we can conclude that each of the following relations hold with probability at least 1 -δ 6K :
1 M M m=1 T (m) L (Q) -T (m) L (Q ⋆ ) -T (Q) + T (Q ⋆ ) ∞ ≤ 2γ∥Q -Q ⋆ ∥ ∞ • 2 L log 12K|S||A| δ ,(138) 1
M M m=1 ζ(m) L ∞ ≤ D • 2 -J • 2 M log 12K|S||A| δ .(139)
For the third term, we can rewrite it as
1 M M m=1 T (m) L (Q ⋆ ) -T (Q ⋆ ) = 1 M ⌈L/M ⌉ M m=1 ⌈L/M ⌉ l=1 T Z (m) l (Q ⋆ ) -T (Q ⋆ ) .
We will use Bernstein inequality element wise to bound the above term. Let σ ⋆ ∈ R |S|×|A| be such that [σ ⋆ (s, a)] 2 = Var(T Z (Q ⋆ )(s, a)), i.e., (s, a)-th element of σ denotes the standard deviation of the random variable T
Z (Q ⋆ )(s, a). Since ∥T Z (Q ⋆ ) -T (Q ⋆ )∥ ∞ ≤ 1 1-γ a.s., Bernstein inequality gives us that 1 M M m=1 T (m) L (Q ⋆ )(s, a) -T (Q ⋆ )(s, a) ≤ σ ⋆ (s, a) 2 L log 6K|S||A| δ + 2 3L(1 -γ) log 6K|S||A| δ .(140)
holds simultaneously for all (s, a) ∈ S × A with probability at least 1 -δ 6K . On combining (137), ( 138), ( 139) and ( 140), we obtain that
T L (Q)(s, a) -T (Q)(s, a) = ∥Q -Q ⋆ ∥ ∞ • 8κ ′ L + σ ⋆ (s, a) 2κ ′ L + 2κ ′ 3L(1 -γ) + D • 2 -J • 2κ ′ M ,(141)
holds simultaneously for all (s, a) ∈ S × A probability at least 1 -δ 2K , where κ ′ = log 12K|S||A| δ . We use this bound in ( 141) to obtain a bound on ∥Q ⋆ H -Q ⋆ ∥ ∞ using the following lemma. Lemma 10 (Wainwright [2019b]). Let π ⋆ and π ⋆ H respectively denote the optimal policies w.r.t.
Q ⋆ and Q ⋆ H . Then, ∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ max (I -γP π ⋆ ) -1 T L (Q) -T (Q) , (I -γP π ⋆ H ) -1 T L (Q) -T (Q) .
Here, for any deterministic policy π, P π ∈ R |S||A|×|S||A| is given by (
P π Q)(s, a) = s ′ ∈S P (s ′ |s, a)Q(s ′ , π(s ′ )). Furthermore, it was shown in Wainwright [2019b, Proof of Lemma 4] that if the error | T L (Q)(s, a) - T (Q)(s, a)| satisfies T L (Q)(s, a) -T (Q)(s, a) ≤ z 0 ∥Q -Q ⋆ ∥ ∞ + z 1 σ ⋆ (s, a) + z 2(142)
for some z 0 , z 1 , z 2 ≥ 0 with z 1 < 1, then the bound in Lemma 10 can be simplified to
∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ 1 1 -z 1 z 0 1 -γ ∥Q -Q ⋆ ∥ ∞ + z 1 (1 -γ) 3/2 + z 2 1 -γ .(143)
On comparing, (141) with (142), we obtain
z 0 ≡ 8κ ′ L ; z 1 ≡ 2κ ′ L ; z 2 ≡ 2κ ′ 3L(1 -γ) + D • 2 -J • 2κ ′ M .
Moreover, the condition L ≥ 32κ ′ implies that z 1 < 1 and 1 1-z1 ≤ √ 2. Thus, on plugging in the above values in (143), we can conclude that
∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ ∥Q -Q ⋆ ∥ ∞ • 16κ ′ L(1 -γ) 2 + 64κ ′ L(1 -γ) 3 + 2κ ′ √ 2 3L(1 -γ) 2 + D • 2 -J (1 -γ) • 4κ ′ M ≤ ∥Q -Q ⋆ ∥ ∞ • 8κ ′ L(1 -γ) 2 + 32κ ′ L(1 -γ) 3 + 2 √ 2κ ′ 3L(1 -γ) 2 + D 40 ,(144)
where once again we use the value of J in the last step.

Section: C.3.3 Proof of Lemma 9
From the iterative update rule in (123), for any agent m we have,
Q m i-1 2 -Q i-1 = η( H (m) i-1 (Q i-1 ) -Q i-1 ) = η( H (m) i-1 (Q i-1 ) -H (m) i-1 (Q ⋆ H ) + H (m) i-1 (Q ⋆ H ) -H(Q ⋆ H ) + Q ⋆ H -Q i-1 ). Thus, ∥Q m i-1 2 -Q i-1 ∥ ∞ ≤ η ∥ H (m) i-1 (Q i-1 ) -H (m) i-1 (Q ⋆ H )∥ ∞ + ∥ H (m) i-1 (Q ⋆ H ) -H(Q ⋆ H )∥ ∞ + ∥Q ⋆ H -Q i-1 ∥ ∞ ≤ η γ∥Q i-1 -Q ⋆ H ∥ ∞ + 2γ∥Q -Q ⋆ H ∥ ∞ + ∥Q ⋆ H -Q i-1 ∥ ∞ = η (1 + γ)∥Q i-1 -Q ⋆ H ∥ ∞ + 2γ∥Q -Q ⋆ H ∥ ∞ ≤ η∥Q -Q ⋆ H ∥ ∞ 7 6 • (1 + γ) + 2γ + ηD(1 + γ) 70 ,
holds with probability 1 -δ 2KI . Here, the second inequality follows from ( 131) and ( 132), The last step in the above relation follows from (135) evaluated at a general value of i and the prescribed value of J. By a union bound argument, the above relation holds for all i with probability at least 1 -δ 2K .

Section: D Numerical Experiments
In this section, we corroborate our theoretical results through simulations. For the simulations, we consider an MDP with 3 states and two actions, i.e., S = {0, 1, 2} and A = {0, 1}. The discount parameter is set to γ = 0.9. The reward and transition kernel of the MDP is based on the hard instance constructed in Appendix B. Specifically, the reward and transition kernel of state 0 is given by the expression in Eqn. (14a). Similarly, the reward and transition kernel corresponding to state 1 and 2 are identical and given by Eqns. ( 14b) and (14c) with p = 0.8.  We perform three empirical studies. In the first study, we compare the proposed algorithm Fed-DVR-Q to the Fed-SynQ algorithm proposed in Woo et al. [2023]. We consider a Federated Q-learning setting with 5 agents. The parameters for both the algorithms were set to the suggested values in the respective papers. Both the algorithms were run with 10 7 samples at each agent. For the communication cost of Fed-SynQ we assume that each real number is expressed using 32 bits.
In Fig 1a, we plot the error rate of the algorithm as a function of the number of samples used. In Fig. 1b we plot the corresponding communication complexities. As evident from Fig 1a, Fed-DVR-Q achieves a smaller error than Fed-SynQ under the same sample budget. Similarly, as suggested by Fig. 1b, Fed-DVR-Q also requires much less communication (measured in terms of the number of bits transmitted) than Fed-SynQ, demonstrating the effectiveness of the proposed approach and corroborating our theoretical results.
In the second study, we examine the effect of the number of agents on the sample and communication complexity of Fed-DVR-Q. We vary the number of agents from 5 to 25 in multiples of 5 and record the sample and communication complexity to achieve an error rate of ε = 0.03. The sample and communication complexities as a function of number of agents are plotted in Figs. 2a and2b respectively. The sample complexity decreases as 1/M while the communication complexity is independent of the number of agents. This corroborates the linear speedup phenomenon suggested by our theoretical results and the independence between communication complexity and the number of agents.
In the last study, we compare the communication complexity of Fed-DVR-Q as function of the discount parameter γ. We consider the same setup as in the first study and vary the values of γ from 0.7 to 0.9 in steps of 0.05. We run the algorithm to achieve an accuracy of ε = 0.1 with parameter choices prescribed in Sec. 4.1.3. We plot the communication cost of Fed-DVR-Q against the effective horizon, i.e., 1 1-γ in Fig. 3. As evident from the figure, the communication scales linearly with the effective horizon, which matches the theoretical claim in Theorem 2.

Section: NeurIPS Paper Checklist


Section: Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? Answer: [Yes] Justification: In the abstract and introduction, we describe that we study the samplecommunication complexity trade-off in Federated Q-learning and derive both converse and achievability results. In Sec. 3 we derive the lower bound on communication complexity and in Sec. 4 we outline the algorithm that matches the lower bound derived earlier.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors? Answer: Justification: We consider an infinite horizon MDP in the tabular setting and derive the results for the class of intermittent communication algorithms. We acknowledge that these assumptions might be restrictive for a certain class of applications and extension to more general settings is discussed as a future direction in Sec. 5. Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error of the mean. • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

Section: Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes] Justification: The empirical studies require no specific compute resources can be easily completed on a regular laptop.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

Section: Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [NA] Justification: The paper is concerned with foundational research and is theoretical in nature with no direct societal impact.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [NA] Justification: The paper is theoretical is nature and does not involve release of data or code and hence poses no such risks.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [NA] Justification: The paper does not use any existing assets.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Section: Acknowledgement
We would like to thank the anonymous reviewers for their constructive feedback. This work is supported in part by the grants NSF CCF-2007911, CCF-2106778, CNS-2148212, ECCS-2318441, ONR N00014-19-1-2404 and AFRL FA8750-20-2-0504, and in part by funds from federal agency and industry partners as specified in the Resilient & Intelligent NextG Systems (RINGS) program.

Section: 
Establishing the sample complexity. In order to establish the bound on the sample complexity, note that during epoch k, each agent takes a total of ⌈L k /M ⌉ + I • B samples, where the first term corresponds to approximating T L (Q (k-1) ) and the second term corresponds to the samples taken during the iterative update scheme. Thus, the total sample complexity is obtained by summing up over all the K epochs. We have,
To continue, notice that
where the first line follows from the choice of L k in Sec. 4.1.3 and the last line follows from K 0 = ⌈ 1 2 log 2 ( 1 1-γ )⌉. Plugging this relation and the choices of I and B (cf. 4.1.3) into the previous bound yields
Plugging in the choice of K finishes the proof.

Section: C.2 Establishing the error guarantees
In this section, we show that the Q-function estimate returned by the Fed-DVR-Q algorithm is ε-optimal with probability at least 1δ. We claim that the estimates of the Q-function generated by the algorithm across different epochs satisfy the following relation for all k ≤ K with probability 1δ:
The required bound on ∥Q (K) -Q ⋆ ∥ ∞ immediately follows by plugging in the value of K. Thus, for the remainder of the section, we focus on establishing the above claim.
Step 1: fixed-point contraction of REFINEESTIMATE. Firstly, note that the variance-reduced update scheme carried out during the REFINEESTIMATE routine resembles that of the classic Qlearning scheme, i.e., fixed-point iteration, with a different operator defined as follows:
Thus, the update scheme at step i ≥ 1 in (11) can then be written as
where
H denote the fixed point of H. Then the update scheme in (123) drives the sequence {Q m i } i≥0 to Q ⋆ H ; further, as long as ∥Q ⋆ -Q ⋆ H ∥ ∞ is small, the required error ∥Q i -Q ⋆ ∥ ∞ can also be controlled. The following lemmas formalize these ideas and pave the path to establish the claim in (121). The proofs are deferred to Appendix C.3.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes] Justification: Both Theorem 1 and 2 clearly state all assumptions used in the statement of main result. The proofs for both the theorems can be found in the appendix.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes] Justification: We have a section with numerical experiments in Appendix D. The section contains all relevant details of our implementation to reproduce the results.

Section: Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not have associated code or data. Guidelines:
• The answer NA means that paper does not include experiments requiring code. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The relevant details can be found in Appendix D. Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The error bars associated with the plots are small and hence we omit them. Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • If this information is not available online, the authors are encouraged to reach out to the asset's creators.

Section: New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [NA] Justification: The paper does not release any new assets.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

Section: Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [NA] Justification: The paper does not involve any crowdsourcing.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

Section: Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [NA] Justification: The paper does not involve research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.


References:
[b0] M Assran; J Romoff; N Ballas; J Pineau; M Rabbat (2019). Gossip-based actor-learner architectures for deep reinforcement learning. 
[b1] M G Azar; R Munos; H J Kappen (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning
[b2] C Beck; R Srikant (2012). Error bounds for constant step-size q-learning. Systems & Control Letters
[b3] V S Borkar; S P Meyn (2000). The o.d.e. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization
[b4] M Braverman; A Garg; T Ma; H L Nguyen; D P Woodruff (2016). Communication lower bounds for statistical estimation problems via a distributed data processing inequality. 
[b5] T Chen; K Zhang; G B Giannakis; T Başar (2021). Communication-efficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems
[b6] Z Chen; S T Maguluri; S Shakkottai; K Shanmugam (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. 
[b7] Z Chen; S T Maguluri; S Shakkottai; K Shanmugam (2021). A lyapunov theory for finite-sample guarantees of asynchronous q-learning and td-learning variants. 
[b8] Z Chen; Y Zhou; R Chen (2021). Multi-agent off-policy tdc with near-optimal sample and communication complexity. 
[b9] Z Chen; Y Zhou; R.-R Chen; S Zou (2022). Sample and communication-efficient decentralized actorcritic algorithms with finite-time analysis. PMLR
[b10] T Doan; S Maguluri; J Romberg (2019). Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. PMLR
[b11] T T Doan; S T Maguluri; J Romberg (2021). Finite-time performance of distributed temporaldifference learning with linear function approximation. SIAM Journal on Mathematics of Data Science
[b12] J C Duchi; M I Jordan; M J Wainwright; Y Zhang (2014). Optimality guarantees for distributed statistical estimation. 
[b13] L Espeholt; H Soyer; R Munos; K Simonyan; V Mnih; T Ward; Y Doron; V Firoiu; T Harley; I Dunning (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. PMLR
[b14] E Even-Dar; Y Mansour (2004). Learning rates for q-learning. Journal of Machine Learning Research
[b15] D A Freedman (1975). On tail probabilities for martingales. The Annals of Probability
[b16] F Haddadpour; M M Kamani; M Mahdavi; V R Cadambe (2019). Local SGD with periodic averaging: Tighter analysis and adaptive synchronization. 
[b17] H V Hasselt (). Double q-learning. 
[b18] T Jaakkola; M Jordan; S Singh (1993). Convergence of stochastic iterative dynamic programming algorithms. 
[b19] H Jin; Y Peng; W Yang; S Wang; Z Zhang (2022). Federated reinforcement learning with environment heterogeneity. PMLR
[b20] S M Kakade (2001). A natural policy gradient. 
[b21] M Kearns; S Singh (1998). Finite-sample convergence rates for q-learning and indirect algorithms. 
[b22] A Khaled; K Mishchenko; P Richtárik (2020). Tighter Theory for Local SGD on Identical and Heterogeneous Data. PMLR
[b23] S Khodadadian; P Sharma; G Joshi; S T Maguluri (2022). Federated reinforcement learning: Linear speedup under markovian sampling. PMLR
[b24] J Kober; J A Bagnell; J Peters (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research
[b25] G Lan; D.-J Han; A Hashemi; V Aggarwal; C G Brinton (2024). Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis. 
[b26] G Li; Y Wei; Y Chi; Y Gu; Y Chen (2021). Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. IEEE Transactions on Information Theory
[b27] G Li; C Cai; Y Chen; Y Wei; Y Chi (2023). Is q-learning minimax optimal? a tight sample complexity analysis. Operations Research
[b28] H.-K Lim; J.-B Kim; J.-S Heo; Y.-H Han (2020). Federated reinforcement learning for training control policies on multiple iot devices. Sensors
[b29] R Liu; A Olshevsky (2023). Distributed TD(0) with almost no communication. IEEE Control Systems Letters
[b30] B Mcmahan; E Moore; D Ramage; S Hampson; B A Arcas (2017). Communication-efficient learning of deep networks from decentralized data. PMLR
[b31] V Mnih; A P Badia; M Mirza; A Graves; T Lillicrap; T Harley; D Silver; K Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning. PMLR
[b32] M Puterman (2014). Markov decision processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons
[b33] G Qu; A Wierman (2020). Finite-time analysis of asynchronous stochastic approximation and q-learning. PMLR
[b34] S Salgia; Q Zhao (2023). Distributed linear bandits under communication constraints. PMLR
[b35] H Shen; K Zhang; M Hong; T Chen (2023). Towards understanding asynchronous advantage actorcritic: Convergence and linear speedup. IEEE Transactions on Signal Processing
[b36] C Shi; C Shen (2021). Federated Multi-Armed Bandits. 
[b37] D Silver; A Huang; C Maddison; A Guez; L Sifre; G Van Den Driessche; J Schrittwieser; I Antonoglou; V Panneershalvam; M Lanctot; S Dieleman; D Grewe; J Nham; N Kalchbrenner; I Sutskever; T Lillicrap; M Leach; K Kavukcuoglu; T Graepel; D Hassabis (2016). Mastering the game of go with deep neural networks and tree search. Nature
[b38] J Sun; G Wang; G B Giannakis; Q Yang; Z Yang (2020). Finite-time analysis of decentralized temporal-difference learning with linear function approximation. PMLR
[b39] R Sutton; A Barton (2018). Reinforcement learning: An introduction. MIT Press
[b40] C Szepesvári (1997). The asymptotic convergence-rate of q-learning. 
[b41] H Tian; I C Paschalidis; A Olshevsky (2024). One-shot averaging for distributed TD (λ) under Markov sampling. IEEE Control Systems Letters
[b42] J N Tsitsiklis (1994). Asynchronous stochastic approximation and q-learning. Machine learning
[b43] J N Tsitsiklis; Z Q Luo (1987). Communication complexity of convex optimization. Journal of Complexity
[b44] R Vershynin (2018). High-dimensional probability: An introduction with applications in data science. Cambridge university press
[b45] H.-T Wai (2020). On the convergence of consensus algorithms with markovian noise and gradient bias. IEEE
[b46] M Wainwright (2019). Stochastic approximation with cone-contractive operators: Sharp l-infinity-bounds for q -learning. 
[b47] M Wainwright (2019). Variance-reduced q-learning is minimax optimal. 
[b48] G Wang; S Lu; G Giannakis; G Tesauro; J Sun (2020). Decentralized td tracking with linear function approximation and its finite-time analysis. 
[b49] H Wang; A Mitra; H Hassani; G J Pappas; J Anderson (2023). Federated temporal difference learning with linear function approximation under environmental heterogeneity. 
[b50] C J Watkins; P Dayan (1992). Q-learning. Machine learning
[b51] J Woo; G Joshi; Y Chi (2023). The blessing of heterogeneity in federated q-learning: Linear speedup and beyond. 
[b52] J Woo; L Shi; G Joshi; Y Chi (2024). Federated offline reinforcement learning: Collaborative single-policy coverage suffices. 
[b53] B Woodworth; J Wang; A Smith; B Mcmahan; N Srebro (2018). Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. 
[b54] B Woodworth; B Bullins; O Shamir; N Srebro (2021). The min-max complexity of distributed stochastic convex optimization with intermittent communication. PMLR
[b55] Z Xie; S Song (2023). Fedkl: Tackling data heterogeneity in federated reinforcement learning by penalizing kl divergence. IEEE Journal on Selected Areas in Communications
[b56] T Yang; S Cen; Y Wei; Y Chen; Y Chi (2023). Federated natural policy gradient methods for multi-task reinforcement learning. 
[b57] E Yurtsever; J Lambert; A Carballo; K Takeda (2020). A survey of autonomous driving: Common practices and emerging technologies. IEEE access
[b58] S Zeng; M A Anwar; T T Doan; A Raychowdhury; J Romberg (2021). A decentralized policy gradient approach to multi-task reinforcement learning. PMLR
[b59] S Zeng; T T Doan; J Romberg (2021). Finite-time analysis of decentralized stochastic approximation with applications in multi-agent and multi-task learning. IEEE
[b60] C Zhang; H Wang; A Mitra; J Anderson (2024). Finite-time analysis of on-policy heterogeneous federated reinforcement learning. 

Figures:
Figure fig_0: 3
Type: figure
Caption: Algorithm 3 :3REFINEESTIMATE(Q, B, I, L, D, J) 1: Input: Initial estimate Q, batch size B, Number of iterations I, recentering sample size L, quantization bound D, message size J 2: // Build an approximation for T Q which is to be used for variance reduced updates 3: for m = 1, 2, . . . , M do 4: Draw ⌈L/M ⌉ i.i.d. samples from the generative model for each (s, a) pair and evaluate T
Data: 

Figure fig_1: 
Type: figure
Caption: from the server and compute T L (Q) according to Eqn. (10) 7: end for 8: Q 0 ← Q 9: // Variance reduced updates with minibatching 10: for i = 1, 2, . . . , I do 11: for m = 1, 2, . . . , M do 12: Draw B i.i.d. samples from the from the generative model for each (s, a) pair 13:
Data: 

Figure fig_2: 
Type: figure
Caption: a) for a ∈ {1, 2} and all agents m // Different initialization 3: for t = 1, 2, . . . , T do The following relation holds for all agents m ∈ [M ], all t ∈ [T ] and a ∈ {1, 2}:
Data: 

Figure fig_3: 
Type: figure
Caption: 121) and the prescribed value of D k . C.3 Proof of auxiliary lemmas C.3.1 Proof of Lemma 7
Data: 

Figure fig_5: 1
Type: figure
Caption: Figure 1 :1Figure 1: Comparison between sample and communication complexities of Fed-DVR-Q and the algorithm Fed-SynQ from Woo et al. [2023].
Data: 

Figure fig_6: 23
Type: figure
Caption: Figure 2 :Figure 3 :23Figure 2: Dependence of sample and communication complexities of Fed-DVR-Q on the number of agents.
Data: 

Figure fig_7: 
Type: figure
Caption: 9.Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: We have read the NeurIPS Code of Ethics and the paper conforms to the NeurIPS Code of Ethics.
Data: 


Formulas:
Formula formula_0: V π (s) := E ∞ t=0 γ t r(s t , a t ) s 0 = s ; Q π (s, a) := E ∞ t=0 γ t r(s t , a t ) s 0 = s, a 0 = a , (1)

Formula formula_1: V ⋆ := V π ⋆ and Q ⋆ := Q π ⋆

Formula formula_2: (T Q)(s, a) = r(s, a) + γ • E s ′ ∼P (•|s,a) max a ′ ∈A Q(s ′ , a ′ ) .

Formula formula_3: (T Z Q)(s, a) = r(s, a) + γV (Z(s, a)),(3)

Formula formula_4: T Q = E Z [T Z Q] for all Q-functions.

Formula formula_5: ER(A ; N, M ) := sup M=(P,r) E ∥ Q M (A , N, M ) -Q ⋆ M ∥ ∞ ,(4)

Formula formula_6: SC(A ; ε, M ) := |S||A| • min{N ∈ N : ER(A ; N, M ) ≤ ε}.(5)

Formula formula_7: SC(A ; ε, M, δ) := |S||A| • min{N ∈ N : Pr(sup M ∥ Q M (A , N, M ) -Q ⋆ M ∥ ∞ ≤ ε) ≥ 1 -δ}.

Formula formula_8: CC round (A ; N ) := 1 M M m=1 C m round (A ; N ); CC bit (A ; N ) := 1 M M m=1 C m bit (A ; N ),(6)

Formula formula_9: 1: Input : T, R, {η t } T t=1 , C = {t r } R r=1 , B 2: Set Q m 0 ← 0 for all agents m 3: for t = 1, 2, . . . , T do 4: for m = 1, 2, . . . , M do 5: Compute Q m t-

Formula formula_10: {η t } T t=1 ; (iv) a communication schedule {t r } R r=1 ; (v) batch size B. During the t th iteration, each agent m computes { T Z b (Q m t-1 )} B b=1

Formula formula_11: Q m t-1 2 = (1 -η t )Q m t-1 + η t B B b=1 T Z b (Q m t-1 ).(7)

Formula formula_12: Q m t = 1 M M j=1 Q j t-1 2 if t ∈ C, Q m t-1 2 otherwise. (8

Formula formula_13: )

Formula formula_14: 1 1+cη(1-γ)t or η t := η for all 1 ≤ t ≤ T . If R = CC round (A ; N ) ≤ c 0 (1 -γ) log 2 N ; or CC bit (A ; N ) ≤ c 1 |S||A| (1 -γ) log 2 N

Formula formula_15: ER(A ; N, M ) ≥ C γ log 3 N √ N ,

Formula formula_16: 1 (1-γ) log 2 N

Formula formula_17: Q (k) ← REFINEESTIMATE(Q (k-1) , B, I, L k , D k , J) 6: k ← k + 1 7: end for 8: return Q (K)

Formula formula_18: T (m) L (Q) := 1 ⌈L/M ⌉ ⌈L/M ⌉ l=1 T Z (m) l (Q),(9)

Formula formula_19: {Z (m) 1 , Z (m) 2 , . . . , Z (m) ⌈L/M ⌉ } are ⌈L/M ⌉ i.i.d. samples with Z (m) 1 ∼ Z. Each agent sends C T (m) L (Q) -Q; D, J , a compressed version of the difference T (m) L

Formula formula_20: T L (Q) = Q + 1 M M m=1 C T (m) L (Q) -Q; D, J(10)

Formula formula_21: Q m i-1 2 = (1 -η)Q i-1 + η T (m) i Q i-1 -T (m) i Q + T L (Q) .(11)

Formula formula_22: T (m) i Q := 1 B z∈Z (m) i T z Q, where Z (m) i

Formula formula_23: m i-1 2 -Q i-1 ; D, J ,

Formula formula_24: Q i = Q i-1 + 1 M M m=1 C Q m i-1 2 -Q i-1 ; D, J ,(12)

Formula formula_25: 0 = d 1 < d 2 , • • • < d 2 J = D correspond

Formula formula_26: dj n -Q[n] dj n -dj n-1

Formula formula_27: d j < Q[i] ≤ d j+1 }.

Formula formula_28: 2 η(1-γ) ⌉, ⌈ 2 M ( 12γ (1-γ) ) 2 log( 8KI|S||A| δ )⌉ and ⌈log 2 ( 70 η(1-γ) 2 M log( 8KI|S||A|δ

Formula formula_29: K = ⌈ 1 2 log 2 ( 1 1-γ )⌉ + ⌈ 1 2 log 2 ( 1 (1-γ)ε 2 )⌉.

Formula formula_30: L k := 19600 (1 -γ) 2 log 8KI|S||A| δ • 4 k if k ≤ K 0 4 k-K0 if k > K 0 ; D k := 16 • 2 -k 1 -γ , (13

Formula formula_31: )

Formula formula_32: K 0 = ⌈ 1 2 log 2 ( 1 1-γ )⌉.

Formula formula_33: SC(Fed-DVR-Q; ε, M, δ) ≤ C 1 ηM (1 -γ) 3 ε 2 log 2 1 (1 -γ)ε log 8KI|S||A| δ , CC round (Fed-DVR-Q; ε, δ) ≤ 16 η(1 -γ) log 2 1 (1 -γ)ε , CC bit (Fed-DVR-Q; ε, δ) ≤ 32|S|A| η(1 -γ) log 2 1 (1 -γ)ε log 2 70 η(1 -γ) 2 M log 8KI|S||A| δ .

Formula formula_34: C ( T (m) L (Q) -Q; D, J) to the server 6: Receive 1 M M m=1 C ( T (m) L (Q) -Q; D, J)

Formula formula_35: Compute Q m i-1 2 according to Eqn. (11) 14: Send C (Q m i-1 2 -Q i-1 ; D, J) to the server 15: Receive 1 M M m=1 C (Q m i -Q i-1 ; D, J)

Formula formula_36: A 0 = {1} P (0|0, 1) = 1 r(0, 1) = 0, (14a) A 1 = {1, 2} P (1|1, 1) = p P (0|1, 1) = 1 -p r(1, 1) = 1, (14b) P (1|1, 2) = p P (0|1, 2) = 1 -p r(1, 2) = 1, (14c) A 2 = {1} P (2|2, 1) = p P (0|2, 1) = 1 -p r(2, 1) = 1, (14d) A 3 = {1} P (3|3, 1) = 1 r(3, 1) = 1,(14e)

Formula formula_37: V ⋆ (0) = Q ⋆ (0, 1) = 0 V ⋆ (1) = Q ⋆ (1, 1) = Q ⋆ (1, 2) = V ⋆ (2) = Q ⋆ (2, 1) = 1 1 -γp = 3 4(1 -γ) V ⋆ (3) = Q ⋆ (3, 1) = 1 1 -γ .

Formula formula_38: η (t) k = η k t i=k+1 (1 -η i (1 -γp)) for all 0 ≤ k ≤ t,(15)

Formula formula_39: t k=τ +1 (1 -η k (1 -γp)) + (1 -γp) t k=τ +1 η (t) k = 1.(16)

Formula formula_40: η (t) k = η k t i=k+1 (1 -η i ) for all 0 ≤ k ≤ t,(17)

Formula formula_41: t k=τ +1 (1 -η k ) + t k=τ +1 η (t) k = 1. (18

Formula formula_42: )

Formula formula_43: (t)

Formula formula_44: η (t) k-1 η (t) k = η k η k-1 (1 -η k ) = 1 - (1 -c η (1 -γ))η k 1 -c η (1 -γ)η k ≤ 1 (19) whenever c η ≤ 1 1-γ . Thus, η(t)

Formula formula_45: c η ≤ 1 1-γ . Similarly, η(t)

Formula formula_46: η (t) k-1 η (t) k = η k η k-1 (1 -η k (1 -γp)) ≤ η k η k-1 (1 -η k ) ≤ 1, whenever c η ≤ 1 1-γ .

Formula formula_47: (1 -γp)η (t) t + (1 -γp)η (t) t-1 = (1 -γp)η t + (1 -γp)η t-1 (1 -(1 -γp)η t ) = 1 -(1 -η t (1 -γp))(1 -η t-1 (1 -γp)) = 1 - t k=t-1 (1 -η k (1 -γp)),

Formula formula_48: (1 -γp) t k=τ η (t) k = (1 -γp)η t τ + (1 -γp) t k=τ +1 η (t) k = (1 -γp)η τ t k=τ +1 (1 -η k (1 -γp)) + 1 - t k=τ +1 (1 -η k (1 -γp)) = 1 - t k=τ (1 -η k (1 -γp)),

Formula formula_49: P m t (s ′ |s, a) = 1 B B b=1 P m t,b (s ′ |s, a),(20)

Formula formula_50: P m t,b (s ′ |s, a) = 1{Z m t,b (s, a) = s ′ } for s ′ ∈ S.

Formula formula_51: Q m 0 (0, 1) = V m 0 (0) = 0 holds for all clients m ∈ [M ]. Assuming that Q m t-1 (0, 1) = V m t-1(

Formula formula_52: Q m t-1/2 (0, 1) = (1 -η t )Q m t-1 (0, 1) + η t (γV m t-1 (0)) = 0. Consequently, Q m

Formula formula_53: Q m t-1/2 (3, 1) = (1 -η t )Q m t-1 (3, 1) + η t (1 + γV m t-1 (3)) = (1 -η t )Q m t-1 (3, 1) + η t (1 + γQ m t-1 (3, 1)) = (1 -η t (1 -γ))Q m t-1 (3, 1) + η t ,

Formula formula_54: V m t (3) = Q m t (3, 1) = t k=1 η k t i=k+1 (1 -η i (1 -γ)) = 1 1 -γ 1 - t i=1 (1 -η i (1 -γ)) ,(21)

Formula formula_55: Q m t-1/2 (1, 1) = (1 -η t )Q m t-1 (1, 1) + η t (1 + γ P m t (1|1, 1)V m t-1 (1)),(22)

Formula formula_56: Q m t-1/2 (1, 2) = (1 -η t )Q m t-1 (1, 2) + η t (1 + γ P m t (1|1, 2)V m t-1 (1)),(23)

Formula formula_57: Q m t-1/2 (2, 1) = (1 -η t )Q m t-1 (2, 1) + η t (1 + γ P m t (2|2, 1)V m t-1 (2)). (24

Formula formula_58: )

Formula formula_59: log N ≤ 1 1 -γ ,(25)

Formula formula_60: E[V m T (2)] =   1 M M j=1 E[V j T (2)]   = T k=1 η (t) k = 1 -η (T ) 0 1 -γp .(26)

Formula formula_61: V ⋆ (2) -E[V m T (2)] = η (T ) 0 1 -γp . (27

Formula formula_62: )

Formula formula_63: V ⋆ (2) -E[V m T (2)],

Formula formula_64: (T ) 0 ) ≥ -1.5 T t=1 η(1 -γp) ≥ -2 T t=1 1 t log T ≥ -2 =⇒ η (T ) 0 ≥ e -2

Formula formula_65: E[∥Q m T -Q ⋆ ∥ ∞ ] ≥ E[|Q ⋆ (2) -Q m T (2)|] ≥ V ⋆ (2) -E[V m T (2)] ≥ 3 4e 2 (1 -γ) √ N(28)

Formula formula_66: Q m t-1/2 (2, 1) = (1 -η t )V m t-1 (2) + η t (1 + γ P m t (2|2, 1)V m t-1 (2)).

Formula formula_67: V m t (2) = Q m t (2, 1) = (1 -η t )V m t-1 (2) + η t (1 + γ P m t (2|2, 1)V m t-1 (2)). (29

Formula formula_68: )

Formula formula_69: E[V m t (2)] = (1 -η t )E[V m t-1 (2)] + η t (1 + γE[ P m t (2|2, 1)V m t-1 (2)]) = (1 -η t )E[V m t-1 (2)] + η t 1 + γE[ P m t (2|2, 1)]E[V m t-1 (2)] = (1 -η t )E[V m t-1 (2)] + η t 1 + γpE[V m t-1 (2)] = (1 -η t (1 -γp))E[V m t-1 (2)] + η t .(30)

Formula formula_70: P m t (2|2, 1) is independent of V m t-1 (2). • Similarly, if t is an averaging instant, we have, V m t (2) = Q m t (2, 1) = 1 M M j=1 Q j t-1/2 (2, 1) = (1 -η t ) 1 M M j=1 V j t-1 (2) + 1 M M j=1 η t (1 + γ P j t (2|2, 1)V j t-1 (2)).(31)

Formula formula_71: E[V m t (2)] = (1 -η t ) 1 M M j=1 E[V j t-1 (2)] + 1 M M j=1 η t (1 + γE[ P j t (2|2, 1)V j t-1 (2)]) = (1 -η t ) 1 M M j=1 E[V j t-1 (2)] + 1 M M j=1 η t (1 + γpE[V j t-1 (2)]) = (1 -η t (1 -γp))   1 M M j=1 E[V j t-1 (2)]   + η t .(32)

Formula formula_72: M M m=1 E[V m t (2)] = (1 -η t (1 -γp)) 1 M M m=1 E[V m t-1 (2)] + η t .(33)

Formula formula_73: τ := min{k ∈ N : ∀ t ≥ k, η t ≤ η k ≤ 3η t }. (34

Formula formula_74: )

Formula formula_75: Q m t is defined in Algorithm 4, where V m t = max a∈{1,2} Q m t (a).

Formula formula_76: 1: Input : T, R, {η t } T t=1 , C = {t r } R r=1 , B 2: Set Q m 0 (a) ← Q ⋆ (1,

Formula formula_77: Q m t (1, a) -Q m t (a) ≥ - 1 1 -γ t i=1 (1 -η i (1 -γ)).

Formula formula_78: ∆ m t (a) := Q m t (a) -Q ⋆ (1, a); a = 1, 2; ∆ m t,max = max a∈{1,2}

Formula formula_79: ∆ t,max := max a∈{1,2} ∆ t (a).(35)

Formula formula_80: ∆ m t (a) = (1 -η t )∆ m t-1 (a) + η t (1 + γ P m t (1|1, a) V m t-1 -Q ⋆ (1, a)) = (1 -η t )∆ m t-1 (a) + η t γ( P m t (1|1, a) V m t-1 -pV ⋆ (1)) = (1 -η t )∆ m t-1 (a) + η t γ(p( V m t-1 -V ⋆ (1)) + ( P m t (1|1, a) -p) V t-1 ) = (1 -η t )∆ m t-1 (a) + η t γ(p∆ m t-1,max + ( P m t (1|1, a) -p) V m t-1

Formula formula_81: ∆ m t,max = max a∈{1,2} ( Q m t (a) -Q ⋆ (1, a)) = max a∈{1,2} Q m t (a) -V ⋆ (1) = V m t-1 -V ⋆ (1), as Q ⋆ (1, 1) = Q ⋆ (1, 2) = V ⋆ (1).

Formula formula_82: ∆ m t (a) = t k=t ′ +1 (1 -η k ) ∆ m t ′ (a) + t k=t ′ +1 η (t) k γ(p∆ m k-1,max + ( P m k (1|1, a) -p) V m k-1 ). (36

Formula formula_83: )

Formula formula_84: φ t ′ ,t := t k=t ′ +1 (1 -η k ),(37)

Formula formula_85: ξ m t ′ ,t (a) := t k=t ′ +1 η (t) k γ( P m k (1|1, a) -p) V m k-1 , a = 1, 2;(38)

Formula formula_86: ξ m t ′ ,t,max := max a∈{1,2} ξ m t ′ ,t (a).(39)

Formula formula_87: ∆ m t (a) = φ t ′ ,t ∆ m t ′ (a) + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t (a).

Formula formula_88: ∆ m t (a) = φ t ′ ,t ∆ t ′ (a) + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t (a).(40)

Formula formula_89: ∆ t (a) = φ t ′ ,t ∆ t ′ (a) + 1 M M m=1 t k=t ′ +1 η (t) k γp∆ m k-1,max + 1 M M m=1 ξ m t ′ ,t (a).(41)

Formula formula_90: ∆ m t,max ≥ φ t ′ ,t ∆ t ′ ,max + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t,max -φ t ′ ,t |∆ t ′ (1) -∆ t ′ (2)|, (42a) ∆ m t,max ≤ φ t ′ ,t ∆ t ′ ,max + t k=t ′ +1 η (t) k γp∆ m k-1,max + ξ m t ′ ,t,max ,(42b)

Formula formula_91: max{a 1 + b 1 , a 2 + b 2 } ≥ min{a 1 , a 2 } + max{b 1 , b 2 } = max{a 1 , a 2 } + max{b 1 , b 2 } -|a 1 -a 2 |. (43

Formula formula_92: )

Formula formula_93: E[∆ m t,max ] ≥ φ t ′ ,t E[∆ t ′ ,max ] + t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|],(44a)

Formula formula_94: E[∆ m t,max ] ≤ φ t ′ ,t E[∆ t ′ ,max ] + t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] + E[ξ m t ′ ,t,max ].(44b)

Formula formula_95: ∆ t,max ≥ φ t ′ ,t ∆ t ′ ,max + 1 M M m=1 t k=t ′ +1 η (t) k γp∆ m k-1,max -φ t ′ ,t |∆ t ′ (1) -∆ t ′ (2)| + max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) (45a) =⇒ E[∆ t,max ] ≥ φ t ′ ,t E[∆ t ′ ,max ] + 1 M M m=1 t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] + E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) . (45b

Formula formula_96: )

Formula formula_97: E[∆ t,max ] ≥ 1 M M m=1 E[∆ m t,max ] -E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] + E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) . (46

Formula formula_98: )

Formula formula_99: E[∆ m t,max ] -E[ξ m t ′ ,t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k -1 + -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|],

Formula formula_100: E[ξ m t ′ ,t,max ] ≥ 1 240 log 180B η T (1-γ) • ν ν + 1 , E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t (2) ≥ 1 240 log 180BM η T (1-γ) • ν ν + √ M ,

Formula formula_101: E[|∆ t (1) -∆ t (2)|] ≤ 8η T 3BM (1 -γ)

Formula formula_102: E[∆ t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k -1 + -2φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] + E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t(2)

Formula formula_103: ≥ (1 -η τ (1 -γp)) t-t ′ E[∆ t ′ ,max ] +   1 -(1 -η τ (1 -γp)) t-t ′ 5760 log 180B η T (1-γ) (1 -γp)   • ν ν + 1 • 1 t -t ′ ≥ 8 η τ -2(1 -η T ) t-t ′ 8η T 3BM (1 -γ) + 1 240 log 180BM η T (1-γ) • ν ν + √ M • 1 t -t ′ ≥ 8 η τ ,(47)

Formula formula_104: t k=t ′ +1 η (t) k -1 ≥ 1 -(1 -η τ (1 -γp)) t-t ′ 24(1 -γp)(48)

Formula formula_105: t k=t ′ +1 η (t) k -1 = t k=t ′ +1 η k t i=k+1 (1 -η i (1 -γp)) -1 ≥ t k=t ′ +1 η t t i=k+1 (1 -η τ (1 -γp)) -1 ≥ η t t k=t ′ +1 (1 -η τ (1 -γp)) t-k -1 ≥ η t • 1 -(1 -η τ (1 -γp)) t-t ′ η τ (1 -γp) -1 ≥ 1 -(1 -η τ (1 -γp)) t-t ′ 3(1 -γp) -1. (49

Formula formula_106: )

Formula formula_107: 1 -(1 -η τ (1 -γp)) t-t ′ 3(1 -γp) ≥ 8 7 for t -t ′ ≥ 8/η τ . Thus, for t -t ′ ≥ 8/η τ we have, 1 -(1 -η τ (1 -γp)) t-t ′ 3(1 -γp) ≥ 1 -exp(-η τ (1 -γp) • (t -t ′ )) 3(1 -γp) ≥ 1 -exp(-8(1 -γp)) 3(1 -γp) . (50

Formula formula_108: )

Formula formula_109: R τ := min{r : t r ≥ τ }.(51)

Formula formula_110: x r := E[∆ tr,max ], α r := (1 -η τ (1 -γp)) tr-tr-1 , β r := (1 -η T ) tr-tr-1 , I r := {r ≥ r ′ > R τ : t r ′ -t r ′ -1 ≥ 8/η τ }, C 1 := 1 5760 log 180B η T (1-γ) (1 -γp) • ν ν + 1 , C 2 := 32η T 3BM (1 -γ) , C 3 := 1 240 log 180BM η T (1-γ) • ν ν + √ M .

Formula formula_111: x r ≥ α r x r-1 -β r C 2 + C 3 1{r ∈ I r } + (1 -α r )C 1 1{r ∈ I r },(52)

Formula formula_112: x r ≥ r i=Rτ +1 α i x Rτ - r k=Rτ +1 β k r i=k+1 α i C 2 + r k=Rτ +1 r i=k+1 α i 1{k ∈ I k }C 3 + C 1   i / ∈Ir α i   1 - i∈Ir α i ,(53)

Formula formula_113: x R ≥ - R k=Rτ β k R i=k+1 α i C 2 + R k=Rτ R i=k+1 α i C 3 1{k ∈ I k } + C 1   i / ∈I R α i   1 - i∈I R α i ≥ -RC 2 + C 1   i / ∈I R α i   1 - i∈I R α i ≥ -R • 32η T 3BM (1 -γ) +   i / ∈I R α i   1 - i∈I R α i • 1 5760 log 180B η T (1-γ) (1 -γp) • ν ν + 1 ,(54)

Formula formula_114: β k R i=k+1 α i ≤ 1 and that C 3 ≥ 0. Consider the expression i / ∈I R α i = i / ∈I R (1 -η τ (1 -γp)) ti-ti-1 ≥ 1 -η τ (1 -γp) • i / ∈I R (t i -t i-1 ) =:T1 . (55

Formula formula_115: )

Formula formula_116: 1 - i∈I R α i = 1 -(1 -η τ (1 -γp)) T -τ -T1 ≥ 1 -exp (-η τ (1 -γp) (T -τ -T 1 )) . (56

Formula formula_117: )

Formula formula_118: T 1 := i / ∈I R (t i -t i-1 ) ≤ (R -|I R |) • 8 η τ ≤ 8R η τ . (57

Formula formula_119: )

Formula formula_120: ). If R ≤ 1 96000(1-γ) log( 180B η(1-γ) )

Formula formula_121: T 1 ≤ 8R η ≤ T 12000 log(180N ) , i / ∈I R α i ≥ 1 -η(1 -γp) • T 1 ≥ 1 - 32R(1 -γ) 3 ≥ 1 - 1 9000 log(180N ) , 1 - i∈I R α i ≥ 1 -exp (-η(1 -γp) (T -T 1 )) ≥ 1 -exp - 4 3 1 - 1 9000 log(180N )

Formula formula_122: x R ≥ √ 40 96000 log 180B η(1-γ) (1 -γ) • ν ν + 1 - ν 5 √ M(58)

Formula formula_123: (x) = x x+1 -x 5 √ M . We claim that for x ∈ [0, √ M ] and all M ≥ 2, f (x) ≥ 7 20 min{x, 1}.(59)

Formula formula_124: x R ≥ √ 40 96000 log 180B η(1-γ) (1 -γ) • 7 20 • min 1, 20η 3B(1 -γ) ≥ √ 40 96000 log (180N ) • 7 20 • min 1 1 -γ , 20 3(1 -γ) 4 N ,(60)

Formula formula_125: √ x log(1/x) is an increasing function and the relation ν M = 20η 3BM (1 -γ) ≤ 1 15 ≤ 1.

Formula formula_126: 1 6ητ (1-γp) }. It is straightforward to note that max 3T 4 , T - 1 6η τ (1 -γp) = 3T 4 if c η ≥ 3 T - 1 6ητ (1-γp) if c η < 3. If R ≤ 1 384000(1-γ) log 180B η T (1-γ) •(5+cη)

Formula formula_127: T 1 ≤ 8R η τ , i / ∈I R α i ≥ 1 -η τ (1 -γp) • T 1 ≥ 1 - 32R(1 -γ) 3 ≥ 1 - 1 36000

Formula formula_128: 1 - i∈I R α i ≥ 1 -exp (-η τ (1 -γp) (T -t Rτ -T 1 )) ≥ 1 -exp - (1 -γ)T (3 + c η (1 -γ)T ) + 32R(1 -γ) 3 ≥ 1 2(3 + c η )

Formula formula_129: 1 - i∈I R α i ≥ 1 -exp (-η τ (1 -γp) (T -t Rτ -T 1 )) ≥ 1 -exp - 1 6 + 32R(1 -γ) 3 ≥ 1 10 .

Formula formula_130: x R ≥ 18 √ 1.6 384000 log 180B η T (1-γ) (1 -γ)(5 + c η ) • ν ν + 1 - ν 18 √ M ≥ 18 √ 1.6 384000 log 180B η T (1-γ) (1 -γ)(5 + c η ) • 7 20 • min 1, 20η T 3B(1 -γ) ≥ 18 √ 1.6 384000 log 180B η T (1-γ) (5 + c η ) • 7 20 • min 1 1 -γ , 20η T 3B(1 -γ) 3 ≥ 18 √ 1.6 384000 log (180N (1 + log N )) (5 + log N ) • 7 20 • min 1 1 -γ , 20 3B(1 + log N )(1 -γ) 4 N ,(61)

Formula formula_131: √ x log(1/x) is an increasing function and the relation ν M = 20η T 3BM (1 -γ) ≤ 1.

Formula formula_132: T 0 := max{ 5T 12 , 2T 3 - 1 6ητ (1-γp) }. Thus, (47) yields that E[∆ t Rτ ] ≥   1 -(1 -η τ (1 -γp)) T0 5760 log 180 Bη T (1-γ) (1 -γp)   • ν ν + 1 -2(1 -η T ) T0 8η T 3BM (1 -γ) .(62)

Formula formula_133: x R ≥ (1 -η τ (1 -γp)) T -t Rτ   1 -(1 -η τ (1 -γp)) T0 5760 log 180 Bη T (1-γ) (1 -γp)   • ν ν + 1 -2(1 -η T ) T0 • (1 -η τ (1 -γp)) T -t Rτ 8η T 3BM (1 -γ) -RC 2 . (63

Formula formula_134: )

Formula formula_135: 1 -(1 -η τ (1 -γp)) T0 ≥ 1 -exp (-η τ (1 -γp)5T /12) ≥ 1 -exp - 5(1 -γ)T 3(3 + c η (1 -γ)T ) ≥ 1 3 + c η , (1 -η τ (1 -γp)) T -t Rτ ≥ 1 -η τ (1 -γp) T 4 ≥ 1 - (1 -γ)T (3 + c η (1 -γ)T ) ≥ 1 - 1 c η ≥ 2 3 .

Formula formula_136: 1 -(1 -η τ (1 -γp)) T0 ≥ 1 -exp -η τ (1 -γp) 2T 3 + 1 6 ≥ 1 -exp - 8(1 -γ)T 3(3 + c η (1 -γ)T ) + 1 6 ≥ 1 -e -5/18 , (1 -η τ (1 -γp)) T -t Rτ ≥ 1 - η τ (1 -γp) 6η τ (1 -γp) ≥ 5 6 .

Formula formula_137: -η τ (1 -γp)) T -t Rτ (1 -(1 -η τ (1 -γp)) T0 ) ≥ c

Formula formula_138: 1 6ητ (1-γp) }.

Formula formula_139: E[| V m T (1) -V ⋆ (1)|] ≥ E[∆ T,max ] ≥ c 0 log 3 N • min 1 1 -γ , 1 (1 -γ) 4 N .(64)

Formula formula_140: E[|V m T -V ⋆ (1)|] ≥ c 0 log 3 N • min 1 1 -γ , 1 (1 -γ) 4 N - 1 1 -γ T i=1 (1 -η i (1 -γ)). (65)

Formula formula_141: E[|V m T (3) -V ⋆ (3)|] ≥ 1 1 -γ T i=1 (1 -η i (1 -γ)).(66)

Formula formula_142: E[∥Q m T -Q ⋆ ∥ ∞ ] ≥ E [max {|V m T (3) -V ⋆ (3)|, |V m T (1) -V ⋆ (1)|}] ≥ max {E [|V m T (3) -V ⋆ (3)|] , E [|V m T (1) -V ⋆ (1)|]} ≥ max 1 1 -γ T i=1 (1 -η i (1 -γ)), min 1 1 -γ , 1 (1 -γ) 4 N - 1 1 -γ T i=1 (1 -η i (1 -γ)) ≥ 1 2 min 1 1 -γ , 1 (1 -γ) 4 N ,(67)

Formula formula_143: CC round = O 1 (1-γ) log 2 N , ER(A ; N, M ) = Ω 1 log 3 N √ N

Formula formula_144: x Rτ +1 ≥ α Rτ +1 x Rτ -β Rτ +1 C 2 + C 3 1{R τ + 1 ∈ I Rτ +1 } + (1 -α Rτ +1 )C 1 1{R τ + 1 ∈ I Rτ +1 }.

Formula formula_145:   i / ∈I Rτ +1 α i     1 - i∈I Rτ +1 α i   = (1 -α Rτ +1 )1{R τ + 1 ∈ I Rτ +1 }

Formula formula_146: x r+1 ≥ α r+1 x r -β r+1 C 2 + C 3 1{(r + 1) ∈ I r+1 } + (1 -α r+1 )C 1 1{r + 1 ∈ I r+1 } ≥ α r+1 r i=Rτ +1 α i x Rτ -α r+1 r k=Rτ +1 β k r i=k+1 α i C 2 + α r+1 r k=Rτ +1 r i=k+1 α i C 3 1{k ∈ I k } + α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i -β r+1 C 2 + C 3 1{(r + 1) ∈ I r+1 } + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 } ≥ r+1 i=Rτ +1 α i x Rτ - r+1 k=Rτ +1 β k r+1 i=k+1 α i C 2 + r+1 k=Rτ +1 r+1 i=k+1 α i C 3 1{k ∈ I k } + α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 }.(69)

Formula formula_147: If (r + 1) / ∈ I r+1 , then 1 -i∈Ir α i = 1 -i∈Ir+1 α i and α r+1 i / ∈Ir α i = i / ∈Ir+1 α i . Consequently, α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 } = C 1   i / ∈Ir+1 α i     1 - i∈Ir+1 α i   .(70)

Formula formula_148: ∈Ir α i = i /

Formula formula_149: α r+1 C 1   i / ∈Ir α i   1 - i∈Ir α i + (1 -α r+1 )C 1 1{(r + 1) ∈ I r+1 } = α r+1 C 1   i / ∈Ir+1 α i   1 - i∈Ir α i + (1 -α r+1 )C 1 ≥ C 1   i / ∈Ir+1 α i   α r+1 1 - i∈Ir α i + (1 -α r+1 ) ≥ C 1   i / ∈Ir+1 α i     1 - i∈Ir+1 α i   .(71)

Formula formula_150: f (x) = x x + 1 - 1 5 √ M ≥ x • 1 2 - x 5 √ M ≥ 7x 20 ,(72)

Formula formula_151: f (1) = 1 2 - 1 5 √ M ≥ 7 20 ; f ( √ M ) = √ M √ M + 1 - 1 5 ≥ 7 20 .

Formula formula_152: x ∈ [1, √ M ], f (x) ≥ 7 20 .(73)

Formula formula_153: V t (2) := 1 M M m=1 V m t (2).(74)

Formula formula_154: E[V t (2)] = (1 -η t (1 -γp))E[V t-1 (2)] + η t .

Formula formula_155: E[V T (2)] = T k=t+1 (1 -η k (1 -γp)) E[V t (2)] + T k=t+1 η (T ) k .

Formula formula_156: V ⋆ (2) -E[V T (2)] = T k=t+1 (1 -η k (1 -γp)) 1 1 -γp -E[V t (2)] .(75)

Formula formula_157: τ ′ := min 0 ≤ t ′ ≤ T -2 E[(V t ) 2 ] ≥ 1 4(1 -γ) 2 for all t ′ + 1 ≤ t ≤ T . If such a τ ′ does not exist, it implies that either E[(V T ) 2 ] < 1 4(1-γ) 2 or E[(V T -1 ) 2 ] < 1 4(1-γ) 2 . If the former is true, then, V ⋆ (2) -E[V T (2)] = 3 4(1 -γ) -E[(V T ) 2 ] > 1 4(1 -γ) . (76

Formula formula_158: ) Similarly, if E[(V T -1 ) 2 ] < 1 4(1-γ) 2 , it implies E[V T -1 ] < 1 2(1-γ)

Formula formula_159: E[V T (2)] = (1 -η T (1 -γp))E[V T -1 (2)] + η T ≤ E[V T -1 (2)] + 1 < 1 2(1 -γ) + 1 6(1 -γ) = 2 3(1 -γ) .

Formula formula_160: V ⋆ (2) -E[V T (2)] > 3 4(1 -γ) - 2 3(1 -γ) > 1 12(1 -γ) . (77

Formula formula_161: )

Formula formula_162: T k=τ ′ +1 (1 -η k (1 -γp)) ≥ 1 2 . (78

Formula formula_163: )

Formula formula_164: V ⋆ (2) -E[V T (2)] = T k=τ ′ +1 (1 -η k (1 -γp)) 1 1 -γp -E[V τ ′ (2)] ≥ 1 2 • 3 4(1 -γ) -E[(V τ ′ (2)) 2 ] ≥ 1 2 • 3 4(1 -γ) - 1 2(1 -γ) ≥ 1 8(1 -γ) ,(79)

Formula formula_165: 0 ≤ T k=τ ′ +1 (1 -η k (1 -γp)) < 1 2 . (80

Formula formula_166: )

Formula formula_167: V m t (2) = 1 1 -γp - t k=t ′ +1 (1 -η k (1 -γp)) 1 1 -γp -V m t ′ (2) + k=t ′ +1 η (t) k γ( P m k (2|2) -p)V m k-1 (2).

Formula formula_168: V T (2) = 1 1 -γp - T k=t ′ +1 (1 -η k (1 -γp)) 1 1 -γp -V t ′ (2) + 1 M M m=1 T k=t ′ +1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2). (81

Formula formula_169: ) Let {F t } T t=0 be a filtration such that F t is the σ-algebra corre- sponding to {{ P m s (2|2)} M m=1 } t s=1 . It is straightforward to note that 1 M M m=1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2)

Formula formula_170: Var(V T (2)) ≥ E T k=τ ′ +2 Var 1 M M m=1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2) F k-1 . (82

Formula formula_171: )

Formula formula_172: Var 1 M M m=1 η (T ) k γ( P m k (2|2) -p)V m k-1 (2) F k-1 = 1 M 2 M m=1 Var η (T ) k γ( P m k (2|2) -p)V m k-1 (2) F k-1 = (η (T ) k ) 2 BM γ 2 p(1 -p) 1 M M m=1 (V m k-1 (2)) 2 ≥ (1 -γ)(4γ -1) 9BM • (η (T ) k ) 2 • (V k-1 (2)) 2 , (83

Formula formula_173: )

Formula formula_174: Var(V T (2)) ≥ (1 -γ)(4γ -1) 9BM • T k=τ ′ +2 (η (T ) k ) 2 • E[(V k-1 (2)) 2 ] ≥ (1 -γ)(4γ -1) 9BM • 1 4(1 -γ) 2 • T k=max{τ,τ ′ }+2 (η (T ) k ) 2 , (84

Formula formula_175: )

Formula formula_176: T k=max{τ ′ ,τ }+2 η (T ) k 2 ≥ T k=max{τ ′ ,τ }+2 η k T i=k+1 (1 -η i (1 -γp) 2 ≥ T k=max{τ ′ ,τ }+2 η T t i=k+1 (1 -η τ (1 -γp)) 2 = η 2 T T k=max{τ ′ ,τ }+2 (1 -η τ (1 -γp)) 2(t-k) ≥ η 2 T • 1 -(1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) η τ (1 -γp)(2 -η τ (1 -γp)) ≥ η T • 1 4(1 -γ) • c ′ , (85

Formula formula_177: )

Formula formula_178: 1 -(1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) ≥ 1 -e -8/9

Formula formula_179: 1 -exp - 8 3 max{1,cη}

Formula formula_180: Var(V T (2)) ≥ (4γ -1) 36BM (1 -γ) • T k=τ ′ +2 (η (T ) k ) 2 ≥ c ′ (4γ -1) 144(1 -γ) • η T BM (1 -γ) ≥ c ′ (4γ -1) 144(1 -γ) • 1 100 , (87

Formula formula_181: )

Formula formula_182: E[(V ⋆ (2) -V T (2)) 2 ] = E[(V ⋆ (2) -E[V T (2)]) 2 ] + Var(V T (2)) ≥ c ′′ (1 -γ)N ,

Formula formula_183: (1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) = (1 -η τ (1 -γp)) 2(T -τ ′ -1) ≤ (1 -η τ (1 -γp)) T -τ ′ ≤ T k=τ ′ +1 (1 -η k (1 -γp)) ≤ 1 2 , (88

Formula formula_184: )

Formula formula_185: (1 -η τ (1 -γp)) 2(T -max{τ ′ ,τ }-1) = (1 -η τ (1 -γp)) 2(T -τ -1) ≤ (1 -η τ (1 -γp)) T -τ ≤ exp - 2T η τ (1 -γp) 3 . (89

Formula formula_186: )

Formula formula_187: exp - 2T η τ (1 -γp) 3 ≤ exp - 2T 3 • 1 (1 -γ)T • 4(1 -γ) 3 = exp - 8 9 (90

Formula formula_188: )

Formula formula_189: exp - 2T η τ (1 -γp) 3 ≤ exp - 2T 3 • 1 1 + c η (1 -γ)T /3 • 4(1 -γ) 3 = exp - 8 3 max{1, c η } (91)

Formula formula_190: A 1 = {1, 2, . . . , |A|} P (1|1, a) = p P (0|1, a) = 1 -p r(1, a) = 1, ∀ a ∈ A (92b) A 2 = {1} P (2|2, 1) = p P (0|2, 1) = 1 -p r(2, 1) = 1, (92c) A 3 = {1} P (3|3, 1) = 1 r(3, 1) = 1,(92d)

Formula formula_191: CC bit = Ω 1 (1-γ) log 2 N

Formula formula_192: CC bit = Ω |S||A| (1-γ) log 2 N .

Formula formula_193: V m t-1 (1) = max a∈{1,2} Q m t-1 (1, a) ≥ max a∈{1,2} Q m t-1 (a) - 1 1 -γ t-1 i=1 (1 -η i (1 -γ)) = V m t-1 - 1 1 -γ t-1 i=1 (1 -η i (1 -γ)). (93) For t / ∈ {t r } R r=1 and a ∈ {1, 2}, we have, Q m t (1, a) -Q m t (a) = Q m t-1/2 (1, a) -Q m t-1/2 (a) = (1 -η t )Q m t-1 (1, a) + η t (1 + γ P m t (1|1, a)V m t-1 (1)) -(1 -η t ) Q m t-1 (a) + η t (1 + γ P m t (1|1, a) V m t-1 ) = (1 -η t )(Q m t-1 (1|1, a) -Q m t-1 (a)) + η t γ P m t (1|1, a)(V m t-1 (1) -V m t-1 ) ≥ - (1 -η t ) 1 -γ t-1 i=1 (1 -η i (1 -γ)) -P m t (1|1, a) • η t γ 1 -γ t-1 i=1 (1 -η i (1 -γ)) ≥ - (1 -η t ) 1 -γ t-1 i=1 (1 -η i (1 -γ)) - η t γ 1 -γ t-1 i=1 (1 -η i (1 -γ)) ≥ - 1 1 -γ t i=1 (1 -η i (1 -γ)). (94

Formula formula_194: )

Formula formula_195: Q m t (1, a) -Q m t (a) = 1 M M m=1 Q m t-1/2 (1, a) - 1 M M m=1 Q m t-1/2 (a) = 1 M M m=1 (1 -η t )Q m t-1 (1, a) + η t (1 + γ P m t (1|1, a)V m t-1 (1)) - 1 M M m=1 (1 -η t ) Q m t-1 (a) + η t (1 + γ P m t (1|1, a) V m t-1 ) = 1 M M m=1 (1 -η t )(Q m t-1 (1, a) -Q m t-1 (a)) + η t γ P m t (1|1, a)(V m t-1 (1) -V m t-1 ) ≥ - 1 1 -γ t i=1 (1 -η i (1 -γ)), (95

Formula formula_196: )

Formula formula_197: E[∆ m t,max ] ≥ φ t ′ ,t E[∆ t ′ ,max ] + t k=t ′ +1 η (t) k γpE[∆ m k-1,max ] + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|].

Formula formula_198: y t = (1 -η t )y t-1 + η t (γpy t-1 + E[ξ m t ′ ,t,max ]). (96

Formula formula_199: )

Formula formula_200: y t = t k=t ′ +1 (1 -η k ) y t ′ + t k=t ′ +1 η k t i=k+1 (1 -η i ) γpy k-1 + t k=t ′ +1 η k t i=k+1 (1 -η i ) E[ξ m t ′ ,t,max ] = φ t ′ ,t y t ′ + t k=t ′ +1 η (t) k γpy k-1 + t k=t ′ +1 η (t) k E[ξ m t ′ ,t,max ]. (97

Formula formula_201: )

Formula formula_202: y t ′ = E[∆ t ′ ,max ] in (

Formula formula_203: E[∆ m t,max ] ≥ y t -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|],

Formula formula_204: t k=t ′ +1 η (t)

Formula formula_205: y t = (1 -η t (1 -γp))y t-1 + η t E[ξ m t ′ ,t,max ],

Formula formula_206: y t = t k=t ′ +1 (1 -η k (1 -γp)) y t ′ + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k . (98

Formula formula_207: )

Formula formula_208: E[∆ m t,max ] -E[ξ m t ′ ,t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] t k=t ′ +1 η (t) k -1 -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|].(99)

Formula formula_209: w t = (1 -η t )w t-1 + η t (γpw t-1 ). (100

Formula formula_210: )

Formula formula_211: w t ′ = E[∆ t ′ ,max ], then E[∆ m t,max ] ≥ w t + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|] and consequently, E[∆ m t,max ] ≥ t k=t ′ +1 (1 -η k (1 -γp)) E[∆ t ′ ,max ] + E[ξ m t ′ ,t,max ] -φ t ′ ,t E[|∆ t ′ (1) -∆ t ′ (2)|].(101)

Formula formula_212: E[ξ m t ′ ,t,max ] = E ξ m t ′ ,t (1) + ξ m t ′ ,t (2) 2 + ξ m t ′ ,t (1) -ξ m t ′ ,t (2) 2 = 1 2 E ξ m t ′ ,t (1) -ξ m t ′ ,t (2) 2 = 1 2 E t k=t ′ +1 η (t) k γ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 =:ζ m t ′ ,t ,(102)

Formula formula_213: E[ξ m t ′ ,t (1)] = E[ξ m t ′ ,t (2)] = 0. Decompose ζ m t ′ ,tas

Formula formula_214: ζ m t ′ ,t = t k=t ′ +1 B b=1 η (t) k γ B (P m k,b (1|1, 1) -P m k,b (1|1, 2)) V m k-1 =: L l=1 z l ,(103)

Formula formula_215: 1 ≤ l ≤ L z l := γ B (P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2)) V m k(l)-1 with k(l) := ⌊l/B⌋ + t ′ + 1; b(l) = ((l -1) mod B) + 1; L = (t -t ′ )B. Let {F l } L l=1 be a filtration such that F l is the σ-algebra corresponding to {P m k(j),b(j) (1|1, 1), P m k(j),b(j) (1|1, 2)} l j=1 .

Formula formula_216: sup l |z l | ≤ sup l η (t) k(l) • γ B • (P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2)) • V m k(l)-1 ≤ η (t) k(l) • γ B(1 -γ) ≤ η t B(1 -γ) ,(104)

Formula formula_217: |(P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2))| ≤ 1 and V m k(l)-1 ≤ 1 1-γ and the third step uses c η ≤ 1 1-γ and the fact that η (T ) k is increasing in k in this regime. (cf. (19)). • Similarly, Var(z l |F l-1 ) ≤ η (t) k(l) 2 γ 2 B 2 • V m k(l)-1 2 • Var(P m k(l),b(l) (1|1, 1) -P m k(l),b(l) (1|1, 2)) ≤ η (t) k(l) 2 γ 2 B 2 (1 -γ) 2 • 2p(1 -p) ≤ 2 η (t) k(l) 2 3B 2 (1 -γ) .(105)

Formula formula_218: Pr   |ζ m t ′ ,t | ≥ 8 log(2/δ) 3B 2 (1 -γ) L l=1 η (t) k(l) 2 + 4η t log(2/δ) 3B(1 -γ)   ≤ δ.(106)

Formula formula_219: δ 0 = (1-γ) 2 2 • E[|ζ m t ′ ,t | 2 ]

Formula formula_220: |ζ m t ′ ,t | ≥ 8 log(2/δ 0 ) 3B(1 -γ) t k=t ′ +1 η (t) k 2 + 4η t log(2/δ 0 ) 3B(1 -γ) =: D.(107)

Formula formula_221: E[ξ m t ′ ,t,max ] = 1 2 E[|ζ m t ′ ,t |] ≥ 1 2 E[|ζ m t ′ ,t |1{|ζ m t ′ ,t | ≤ D}] ≥ 1 2D E[|ζ m t ′ ,t | 2 1{|ζ m t ′ ,t | ≤ D}] ≥ 1 2D E[|ζ m t ′ ,t | 2 ] -E[|ζ m t ′ ,t | 2 1{|ζ m t ′ ,t | > D}] ≥ 1 2D E[|ζ m t ′ ,t | 2 ] - Pr(|ζ m t ′ ,t | > D) (1 -γ) 2 ≥ 1 4D • E[|ζ m t ′ ,t | 2 ].(108)

Formula formula_222: |ζ m t ′ ,t | ≤ t k=t ′ +1 η (t) k (1 -γ) ≤ 1 (1 -γ)

Formula formula_223: E[|ζ m t ′ ,t | 2 ] in order obtain a lower bound for E[ξ m t ′ ,t,max ].

Formula formula_224: E[|ζ m t ′ ,t | 2 ].

Formula formula_225: E V m t 2 ≥ 1 2(1 -γ) 2 .

Formula formula_226: E[|ζ m t ′ ,t | 2 ] = E L l=1 Var (z l |F l-1 ) = E L l=1 E z 2 l |F l-1 ≥ L l=1 η (t) k(l) 2 γ 2 B 2 • 2p(1 -p) • E V m k(l)-1 2 ≥ L l=1 η (t) k(l) 2 γ 2 B 2 • 2p(1 -p) • 1 2(1 -γ) 2 ≥ 2 9B(1 -γ) • t k=max{t ′ ,τ }+1 η (t) k 2 ,(109)

Formula formula_227: t k=max{t ′ ,τ }+1 η (t) k 2 for t - max{t ′ , τ } ≥ 1/η τ . We have t k=max{t ′ ,τ }+1 η (t) k 2 ≥ t k=max{t ′ ,τ }+1 η k t i=k+1 (1 -η i ) 2 (i) ≥ t k=max{t ′ ,τ }+1 η t t i=k+1 (1 -η τ ) 2 = η 2 t t k=max{t ′ ,τ }+1 (1 -η τ ) 2(t-k) ≥ η 2 t • 1 -(1 -η τ ) 2(t-max{t ′ ,τ }) η τ (2 -η τ ) ≥ η t • 1 -exp(-2) 6 ≥ η t 10 ≥ η T 10 ,(110)

Formula formula_228: D = 8 log(2/δ 0 ) 3B(1 -γ) t k=t ′ +1 η (t) k 2 + 4η t log(2/δ 0 ) 3B(1 -γ) ≤ 9 2 E[|ζ m t ′ ,t | 2 ] • 8 log(2/δ 0 ) 3 1 B(1 -γ) t k=t ′ +1 η (t) k 2 -1/2 + 60 • E[|ζ m t ′ ,t | 2 ] • log(2/δ 0 ) ≤ 3E[|ζ m t ′ ,t | 2 ] • log(2/δ 0 ) 60B(1 -γ) η t + 20 ≤ 60E[|ζ m t ′ ,t | 2 ] • log(2/δ 0 ) 3B(1 -γ) 20η T + 1 ,

Formula formula_229: E[ξ m t ′ ,t,max ] ≥ 1 240 log(2/δ 0 ) • ν ν + 1 ,(111)

Formula formula_230: δ 0 = (1 -γ) 2 2 • E[|ζ m t ′ ,t | 2 ] ≥ (1 -γ) 9B • t k=t ′ +1 η (t) k 2 ≥ η T (1 -γ) 90B .

Formula formula_231: E max 1 M M m=1 ξ m t ′ ,t (1), 1 M M m=1 ξ m t ′ ,t(2

Formula formula_232: ∆ t (1) -∆ t (2) = t k=t ′ +1 (1 -η k ) (∆ t ′ (1) -∆ t ′ (2)) + 1 M M m=1 t k=t ′ +1 η k t i=k+1 (1 -η i ) γ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 .

Formula formula_233: ∆ t (1) -∆ t (2) = t k=1 M m=1 η k t i=k+1 (1 -η i ) γ M ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 .

Formula formula_234: f such that f (Z t ) is integrable for all t, {f (Z t )} t is a sub-martingale adapted to {G t } t . Since f (x) = |x| is a convex function, {|∆ t (1) -∆ t (2)|} t is a submartingale adapted to the filtration {F t } t . As a result, sup 1≤t≤T E[|∆ t (1) -∆ t (2)|] ≤ E[|∆ T (1) -∆ T (2)|] ≤ E[(∆ T (1) -∆ T (2)) 2 ] 1/2 . (112

Formula formula_235: )

Formula formula_236: E   t i=1 X i 2   = E   E   t i=1 X i 2 G t-1     = E   E   X 2 t + 2X t t-1 i=1 X i + t-1 i=1 X i 2 G t-1     = E X 2 t + E   t-1 i=1 X i 2   = t i=1 E X 2 i ,(113)

Formula formula_237: sup 1≤t≤T E[|∆ t (1) -∆ t (2)|] ≤ E[(∆ T (1) -∆ T (2)) 2 ] 1/2 ≤   T k=1 E   M m=1 η (T ) k • γ M • ( P m k (1|1, 1) -P m k (1|1, 2)) V m k-1 2     1/2 ≤ T k=1 η (T ) k 2 • 2γ 2 p(1 -p) BM 2 • M m=1 E V m k-1 2 1/2 ≤ T k=1 η (T ) k 2 • 2γ 2 p(1 -p) BM (1 -γ) 2 1/2 . (114

Formula formula_238: )

Formula formula_239: T k=1 η (T ) k 2 = T k=1 η k T i=k+1 (1 -η i ) 2 = T k=1 η 2 (1 -η) 2(T -k) ≤ η 2 1 -(1 -η) 2 ≤ η. (115)

Formula formula_240: T k=1 η (T ) k 2 = τ k=1 η (T ) k 2 + T k=τ +1 η k T i=k+1 (1 -η i ) 2 ≤ τ k=1 η (T ) τ 2 + T k=τ +1 η 2 k (1 -η T ) 2(T -k) ≤ η 2 τ (1 -η T ) 2(T -τ ) • τ + η 2 τ • 1 η T (2 -η T ) ≤ 3η T • η T • T • exp - 4T η T 3 + 3η T ≤ 9 4e η T + 3η T ≤ 4η T ,(116)

Formula formula_241: (T ) k

Formula formula_242: sup 1≤t≤T E[|∆ t (1) -∆ t (2)|] ≤ 8η T 3BM (1 -γ) ,(117)

Formula formula_243: 1: r ← 1, Q m 0 = Q ⋆ (1, 1

Formula formula_244: Q m t-1/2 ← (1 -η t )Q m t-1 (a) + η t (1 + P m t (1|1, 1)Q m t-1 ) 5:

Formula formula_245: It is straightforward to note that Q m t (1) ≥ Q m t

Formula formula_246: Q m t-1/2 (1) = (1 -η t ) Q m t-1 (1) + η t (1 + γ P m t (1|1, 1) V m t-1 ) ≥ (1 -η t ) Q m t-1 (1) + η t (1 + γ P m t (1|1, 1) Q m t-1 (1)) ≥ (1 -η t )Q m t-1 + η t (1 + γ P m t (1|1, 1)Q m t-1 ) = Q m t-1/2 . Since Q m t and Q m t

Formula formula_247: Q m t (1) ≥ Q m t . Since V m t ≥ Q m t (1) ≥ Q m t

Formula formula_248: E[Q m t ] = 1 1 -γp .(118)

Formula formula_249: E[( V m t ) 2 ] ≥ E[ V m t ] 2 ≥ E[Q m t ] 2 ≥ 1 1 -γp 2 ≥ 1 2(1 -γ) 2 ,

Formula formula_250: m 0 ] = 1 1-γp holds by choice of initialization. Assume that E[Q m t-1 ] = 1

Formula formula_251: Q m t = (1 -η t )Q m t-1 + η t (1 + γ P m t (1|1, 1)Q m t-1 ) =⇒ E[Q m t ] = (1 -η t )E[Q m t-1 ] + η t (1 + γE[ P m t (1|1, 1)Q m t-1 ]) = (1 -η t )E[Q m t-1 ] + η t (1 + γpE[Q m t-1 ]) = (1 -η t ) 1 -γp + η t 1 + γp 1 -γp = 1 1 -γp . (119

Formula formula_252: )

Formula formula_253: Q m t = (1 -η t ) M M j=1 Q j t-1 + η t 1 M M j=1

Formula formula_254: Q j t-1 ) =⇒ E[Q m t ] = (1 -η t ) M M j=1 E[Q j t-1 ] + η t 1 M M j=1 (1 + γE[ P j t (1|1, 1)Q j t-1 ]) = (1 -η t ) M M j=1 1 1 -γp + η t 1 M M j=1 1 + γp 1 -γp = 1 1 -γp ,(120)

Formula formula_255: CC round (Fed-DVR-Q; ε, M, δ) ≤ (I + 1)K ≤ 16 η(1 -γ) log 2 1 (1 -γ)ε ,

Formula formula_256: CC bit (Fed-DVR-Q; ε, M, δ) ≤ J • |S||A| • CC round (Fed-DVR-Q; ε, M, δ) ≤ 32|S|A| η(1 -γ) log 2 1 (1 -γ)ε log 2 70 η(1 -γ) 4 M log 8KI|S||A| δ ,

Formula formula_257: ∥Q I -Q ⋆ H ∥ ∞ ≤ 1 6 ∥Q -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H ∥ ∞ + D 70

Formula formula_258: ∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ ∥Q -Q ⋆ ∥ ∞ • 16κ ′ L(1 -γ) 2 + 64κ ′ L(1 -γ) 3 + 2κ ′ √ 2 3L(1 -γ) 2 + D 70 , whenever L ≥ 32κ ′ , where κ ′ = log 12K|S||A| δ .

Formula formula_259: Q := Q (k

Formula formula_260: H k (Q) := T (Q) -T (Q (k-1) ) + T L k (Q (k-1) ),(124)

Formula formula_261: ∥Q (k) -Q ⋆ ∥ ∞ ≤ ∥Q (k) -Q ⋆ H k ∥ ∞ + ∥Q ⋆ H -Q ⋆ H k ∥ ∞ ≤ 1 6 ∥Q (k-1) -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H k ∥ ∞ + D k 70 + ∥Q ⋆ H k -Q ⋆ ∥ ∞ = 1 6 ∥Q (k-1) -Q ⋆ ∥ ∞ + 7∥Q ⋆ -Q ⋆ H k ∥ ∞ + D k 70 ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 64κ ′ L k (1 -γ) 3 + 2 √ 2κ ′ 3L k (1 -γ) 2 + 13D k 420 ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 100κ ′ L k (1 -γ) 3 + 13D k 420 ,(125)

Formula formula_262: L k (1-γ) 2 κ ′ ≥ 1.

Formula formula_263: ∥Q (k) -Q ⋆ ∥ ∞ ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 100κ ′ L k (1 -γ) 3 + 13D k 420 ≤ 2 -(k-1) 1 -γ 1 6 + 2 -k • 7 6 8 19600 + 2 -k • 7 6 50 19600(1 -γ) + 104 420 • 2 -(k-1) 1 -γ ≤ 2 -(k-1) 1 -γ 1 6 + 7 6 91 39200 + 1 4 ≤ 2 -k 1 -γ . (126

Formula formula_264: )

Formula formula_265: ∥Q (k) -Q ⋆ ∥ ∞ ≤ ∥Q (k-1) -Q ⋆ ∥ ∞ 1 6 + 7 6 16κ ′ L k (1 -γ) 2 + 7 6 100κ ′ L k (1 -γ) 3 + 13D k 420 ≤ 2 -(k-1) 1 -γ 1 6 + 2 -(k-K0) • 7 6 8 19600 + 2 -(k-K0) • 7 6 50 19600(1 -γ) + 104 420 • 2 -(k-1) 1 -γ ≤ 2 -(k-1) 1 -γ 1 6 + 7 6 1 196 + 1 4 ≤ 2 -k 1 -γ . (127

Formula formula_266: )

Formula formula_267: ∥Q (k) -Q ⋆ ∥ ∞ ≤ 2 -k

Formula formula_268: ∥Q m i-1 2 -Q i-1 ∥ ∞ ≤ η∥Q -Q ⋆ H ∥ ∞ 7 6 • (1 + γ) + 2γ + ηD(1 + γ) 70 .

Formula formula_269: η∥Q (k-1) -Q ⋆ H k ∥ ∞ 7 6 • (1 + γ) + 2γ + ηD k (1 + γ) 70 ≤ 13 3 ∥Q (k-1) -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H k ∥ ∞ + D k (1 + γ) 70 ≤ 13 3 • 15 14 • ∥Q (k-1) -Q ⋆ ∥ ∞ + 2D k 70 ≤ 195 42 + 16 70 • 2 -(k-1) 1 -γ ≤ 8 • 2 -(k-1) 1 -γ := D k .

Formula formula_270: ∥Q (k-1) -Q ⋆ ∥ ∞ from(

Formula formula_271: Q i = Q i-1 + 1 M M m=1 C Q m i-1 2 -Q i-1 ; D, J = Q i-1 + 1 M M m=1 Q m i-1 2 -Q i-1 + ζ m i = 1 M M m=1 Q m i-1 2 + ζ m i = (1 -i-1 + η M M m=1 H (m) i (Q i-1 ) + 1 M M m=1 ζ m i =:ζi . (128

Formula formula_272: )

Formula formula_273: Q i -Q ⋆ H = (1 -η)(Q i-1 -Q ⋆ H ) + η M M m=1 H (m) i (Q i-1 ) -Q ⋆ H + ζ i = (1 -η)(Q i-1 -Q ⋆ H ) + η M M m=1 H (m) i (Q i-1 ) -H (m) i (Q ⋆ H ) + η M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) + ζ i . (129)

Formula formula_274: ∥Q i -Q ⋆ H ∥ ∞ ≤ (1 -η)∥Q i-1 -Q ⋆ H ∥ ∞ + η M M m=1 H (m) i (Q i-1 ) -H (m) i (Q ⋆ H ) ∞ + η M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) ∞ + ∥ζ i ∥ ∞ ,(130)

Formula formula_275: H (m) i (Q) -H (m) i (Q ⋆ H ) ∞ = T (m) i (Q) -T (m) i (Q ⋆ H ) ∞ ≤ γ ∥Q -Q ⋆ H ∥ ∞ ,(131)

Formula formula_276: 1 M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) = 1 M B M m=1 z∈Z (m) i T z (Q ⋆ H ) -T z (Q) -T (Q ⋆ H ) + T (Q) . Note that T z (Q ⋆ H ) -T z (Q) -T (Q ⋆ H ) + T (Q) is a zero-mean random vector satisfying ∥T z (Q ⋆ H ) -T z (Q) -T (Q ⋆ H ) + T (Q)∥ ∞ ≤ 2γ∥Q -Q ⋆ H ∥ ∞ .(132)

Formula formula_277: 1 M M m=1 H (m) i (Q ⋆ H ) -H(Q ⋆ H ) ∞ ≤ 2γ∥Q -Q ⋆ H ∥ ∞ • 2 M B log 8KI|S||A| δ(133)

Formula formula_278: ∥ζ i ∥ ∞ ≤ D • 2 -J • 2 M log 8KI|S||A| δ(134)

Formula formula_279: ∥Q i -Q ⋆ H ∥ ∞ ≤ (1 -η(1 -γ))∥Q i-1 -Q ⋆ H ∥ ∞ + 2ηγ∥Q -Q ⋆ H ∥ ∞ • 2κ M B + D • 2 -J • 2κ M .

Formula formula_280: ∥Q I -Q ⋆ H ∥ ∞ ≤ (1 -η(1 -γ)) I ∥Q 0 -Q ⋆ H ∥ ∞ + 2κ M 2ηγ √ B ∥Q -Q ⋆ H ∥ ∞ + D • 2 -J • I i=1 (1 -η(1 -γ)) I-i ≤ (1 -η(1 -γ)) I ∥Q -Q ⋆ H ∥ ∞ + 1 η(1 -γ) 2κ M 2ηγ √ B ∥Q -Q ⋆ H ∥ ∞ + D • 2 -J ≤ ∥Q -Q ⋆ H ∥ ∞ (1 -η(1 -γ)) I + 2γ (1 -γ) 2κ M B + D • 2 -J η(1 -γ) • 2κ M (135) ≤ ∥Q -Q ⋆ H ∥ ∞ 6 + D 70 ≤ 1 6 ∥Q -Q ⋆ ∥ ∞ + ∥Q ⋆ -Q ⋆ H ∥ ∞ + D 70 .(136)

Formula formula_281: ∥Q ⋆ H -Q ⋆ ∥ ∞ depends on the error term T L (Q) -T (Q).

Formula formula_282: T L (Q) -T (Q) = Q + 1 M M m=1 C T (m) L (Q) -Q -T (Q) = 1 M M m=1 T (m) L (Q) + ζ(m) L -T (Q) = 1 M M m=1 T (m) L (Q) -T (m) L (Q ⋆ ) -T (Q) + T (Q ⋆ ) + 1 M M m=1 ζ(m) L + 1 M M m=1 T (m) L (Q ⋆ ) -T (Q ⋆ ) ,(137)

Formula formula_283: L := T (m) L (Q) -Q -C T (m) L

Formula formula_284: 1 M M m=1 T (m) L (Q) -T (m) L (Q ⋆ ) -T (Q) + T (Q ⋆ ) ∞ ≤ 2γ∥Q -Q ⋆ ∥ ∞ • 2 L log 12K|S||A| δ ,(138) 1

Formula formula_285: M M m=1 ζ(m) L ∞ ≤ D • 2 -J • 2 M log 12K|S||A| δ .(139)

Formula formula_286: 1 M M m=1 T (m) L (Q ⋆ ) -T (Q ⋆ ) = 1 M ⌈L/M ⌉ M m=1 ⌈L/M ⌉ l=1 T Z (m) l (Q ⋆ ) -T (Q ⋆ ) .

Formula formula_287: Z (Q ⋆ )(s, a). Since ∥T Z (Q ⋆ ) -T (Q ⋆ )∥ ∞ ≤ 1 1-γ a.s., Bernstein inequality gives us that 1 M M m=1 T (m) L (Q ⋆ )(s, a) -T (Q ⋆ )(s, a) ≤ σ ⋆ (s, a) 2 L log 6K|S||A| δ + 2 3L(1 -γ) log 6K|S||A| δ .(140)

Formula formula_288: T L (Q)(s, a) -T (Q)(s, a) = ∥Q -Q ⋆ ∥ ∞ • 8κ ′ L + σ ⋆ (s, a) 2κ ′ L + 2κ ′ 3L(1 -γ) + D • 2 -J • 2κ ′ M ,(141)

Formula formula_289: Q ⋆ and Q ⋆ H . Then, ∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ max (I -γP π ⋆ ) -1 T L (Q) -T (Q) , (I -γP π ⋆ H ) -1 T L (Q) -T (Q) .

Formula formula_290: P π Q)(s, a) = s ′ ∈S P (s ′ |s, a)Q(s ′ , π(s ′ )). Furthermore, it was shown in Wainwright [2019b, Proof of Lemma 4] that if the error | T L (Q)(s, a) - T (Q)(s, a)| satisfies T L (Q)(s, a) -T (Q)(s, a) ≤ z 0 ∥Q -Q ⋆ ∥ ∞ + z 1 σ ⋆ (s, a) + z 2(142)

Formula formula_291: ∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ 1 1 -z 1 z 0 1 -γ ∥Q -Q ⋆ ∥ ∞ + z 1 (1 -γ) 3/2 + z 2 1 -γ .(143)

Formula formula_292: z 0 ≡ 8κ ′ L ; z 1 ≡ 2κ ′ L ; z 2 ≡ 2κ ′ 3L(1 -γ) + D • 2 -J • 2κ ′ M .

Formula formula_293: ∥Q ⋆ H -Q ⋆ ∥ ∞ ≤ ∥Q -Q ⋆ ∥ ∞ • 16κ ′ L(1 -γ) 2 + 64κ ′ L(1 -γ) 3 + 2κ ′ √ 2 3L(1 -γ) 2 + D • 2 -J (1 -γ) • 4κ ′ M ≤ ∥Q -Q ⋆ ∥ ∞ • 8κ ′ L(1 -γ) 2 + 32κ ′ L(1 -γ) 3 + 2 √ 2κ ′ 3L(1 -γ) 2 + D 40 ,(144)

Formula formula_294: Q m i-1 2 -Q i-1 = η( H (m) i-1 (Q i-1 ) -Q i-1 ) = η( H (m) i-1 (Q i-1 ) -H (m) i-1 (Q ⋆ H ) + H (m) i-1 (Q ⋆ H ) -H(Q ⋆ H ) + Q ⋆ H -Q i-1 ). Thus, ∥Q m i-1 2 -Q i-1 ∥ ∞ ≤ η ∥ H (m) i-1 (Q i-1 ) -H (m) i-1 (Q ⋆ H )∥ ∞ + ∥ H (m) i-1 (Q ⋆ H ) -H(Q ⋆ H )∥ ∞ + ∥Q ⋆ H -Q i-1 ∥ ∞ ≤ η γ∥Q i-1 -Q ⋆ H ∥ ∞ + 2γ∥Q -Q ⋆ H ∥ ∞ + ∥Q ⋆ H -Q i-1 ∥ ∞ = η (1 + γ)∥Q i-1 -Q ⋆ H ∥ ∞ + 2γ∥Q -Q ⋆ H ∥ ∞ ≤ η∥Q -Q ⋆ H ∥ ∞ 7 6 • (1 + γ) + 2γ + ηD(1 + γ) 70 ,

