\section{\large{\BQ} Policy with full information feedback}  \label{algorithm}
%In this section, we propose \textsc{BanditQ} - an online learning policy that solves the above constrained prediction problem. 
%Although our major result pertains to the bandit feedback setting, 
%for simplicity of exposition, we first 
In this Section, we consider the full-information setup when the entire reward vector is revealed to the learner at the end of each round. Apart from a technical result regarding the diameter of an auxiliary random process (Proposition \ref{uniform_bd_lemma} in the Appendix), the extension of the full-information policy to the bandit setting requires no substantially new ideas and will be dealt with in the following section. 
On a high level, the \textsc{BanditQ} policy first defines a \emph{queueing} dynamics to take into account the gap between the target reward and the reward accrued by the policy for each arm so far. It then extends the \emph{drift-plus-penalty} framework of \citet[Chapter 4]{neely2010stochastic} to simultaneously achieve a small regret and meet the long-term constraints. However, to make this overall scheme work, we must adapt the asymptotic stochastic setting of \cite{neely2010stochastic} to the non-asymptotic adversarial setup with online information. This extension turns out to be highly non-trivial and requires new proof and algorithmic techniques, which are very different from that of the \textsf{Max-Weight} policy proposed by \citet{neely2010stochastic}.

We associate a non-negative state variable $Q_i(t)$ to each protected arm $i \in \mathcal{P}.$ Under the action of an online policy $\pi = \{\bm{x}(t)\}_{t \geq 1},$ the state variables evolve according to the following queueing dynamics, known as the Lindley recursion \citep{lindley1952theory}:
\begin{eqnarray} \label{q-ev}
	Q_i(t)=\big(Q_i(t-1)+ \lambda_i - r_i(t)x_i(t)\big)^+, ~Q_i(0)=0,
\end{eqnarray}
where we adopt the standard notation $(y)^+\equiv \max(0,y).$  We set $Q_i(t)=0, \forall t, \forall i \notin \mathcal{P}$. To get an intuition for Eq.\  \eqref{q-ev}, imagine that on every round $t,$ a fixed deterministic amount of work $\lambda_i$ arrives at the queue $Q_i.$ Then, under the action $\bm{x}(t)$ of an online policy, $\min(Q(t-1)+ \lambda_i, r_i(t) x_i(t))$ amount of work departs from $Q_i.$ It is intuitive that to stabilize the queues, the long-term service rates must be at least as large as the long-term arrival rates. Thus, any online policy stabilizing the queues would automatically satisfy the target rate requirements. However, since we are also interested in achieving a small regret, meeting the rate constraints alone is not enough (\emph{c.f.} \cite{huang2023queue}). Our online policy must also perform competitively in terms of the cumulative rewards against every feasible stationary action given by \eqref{feas}. 

Towards this goal, let us first define the following quadratic potential function (\emph{a.k.a.} Lyapunov function in the queueing theory parlance):
\begin{eqnarray} \label{potential_def}
	\Phi(t) = \sum_{i \in \mathcal{P}} Q_i^2(t).  
\end{eqnarray}
We now have an upper bound on the change of potential under the action of a policy. From \eqref{q-ev}, we have 
\begin{eqnarray*}
	&&Q_i^2(t) \\
	 &\leq& \big(Q_i(t-1)+ \lambda_i - r_i(t)x_i(t)\big)^2 \\
	&\leq& Q_i^2(t-1) + \lambda_i + x_i(t)+ 2Q_i(t-1)(\lambda_i - r_i(t) x_i(t)), 
\end{eqnarray*}
where, in the last inequality, we have used the fact that $0\leq \lambda_i, r_i(t), x_i(t) \leq 1, \forall i, t.$ Summing up the above inequality for each $i \in \mathcal{P}$, we have the following upper bound for the change of the potential on round $t$:
\begin{eqnarray} \label{potential_change}
	\Phi(t) - \Phi(t-1) \leq 2 + 2\sum_{i \in \mathcal{P}} Q_i(t-1)\big(\lambda_i - r_i(t) x_i(t)\big),
\end{eqnarray}
where we have used the fact that $\sum_i \lambda_i \leq 1, \sum_i x_i(t) \leq 1,$ where the first inequality follows from the non-emptiness of $\Omega(\bm{\vec{\lambda}}).$  Eqn.\ \eqref{potential_change} suggests that running a MAB policy to maximize virtual cumulative rewards such that pulling the $i$\textsuperscript{th} arm on round $t$ yields a virtual reward of $Q_i(t-1)r_i(t)$ will help minimize the change of the potential on round $t$ and, hence, meet the target rates. However, this does not explicitly take into account our other goal, namely to minimize the regret. To achieve both goals, motivated by the drift-plus-penalty framework of \cite{neely2010stochastic}, we now define an instance of the standard online linear optimization (OLO) problem $\Xi$ with action set $\Delta_N$, where the surrogate reward of the $i$\textsuperscript{th} arm on round $t$ is defined as: 
\begin{eqnarray} \label{reward_def}
	r'_i(t) \equiv \big(Q_i(t-1) + V\big)r_i(t), ~\forall i \in [N].
\end{eqnarray}
In the above, $V>0$ is a hyper-parameter, to be fixed later, that depends only on the length of the horizon $T.$
 %$\{V_t\}_{t\geq 1}$ is a sequence of non-negative parameters. In our theoretical results, we will primarily consider a constant sequence $V_t=V=\Theta(\sqrt{T}), \forall t.$ 
 Intuitively, the surrogate reward vector $\bm{r}'(t)$ strikes a balance between attaining the target rates (through the first term) and achieving a small regret (through the second term). However, the definition of rewards \eqref{reward_def} leads to two significant technical challenges in learning the surrogate rewards. First, due to the presence of the queue variables, the reward vectors $\bm{r}'(t)$ are not bounded \emph{a priori}, which critically affects the regret bound for the surrogate problem $\Xi$. Second, although the original reward sequence $\{\bm{r}(t)\}_{t\geq 1}$ is i.i.d., the reward sequence $\{\bm{r}'(t)\}_{t \geq 1}$ for the problem $\Xi$ is \emph{not} i.i.d. any more, again due to the presence of the queue variables, which are temporally correlated via Eqn.\ \eqref{q-ev}. The second difficulty prompts us to use an adversarial online learning policy for the auxiliary OLO problem $\Xi.$ 
 %To recapitulate the information structure, the coefficients $Q_i(t-1)+V_t$ are known, but the reward vector $\bm{r}(t)$ is unknown to the online policy before it makes the decision $\bm{x}(t) \in \Delta_N$ on round $t$. 
 \paragraph{The \BQ policy:} 
 The proposed \textsc{BanditQ} policy can use any adaptive no-regret policy with a second-order regret bound for the auxiliary problem $\Xi.$ This includes policies such as Online Gradient Ascent (\textsc{OGA}) with adaptive step sizes \citep{orabona2019modern} \footnote{Since ours is a maximization problem, we use a gradient ascent step rather than descent.} and \textsc{Squint} \citep{koolen2015second}. To fix ideas, in this paper, we will use the \textsc{OGA} policy due to its simplicity. This online policy, which is closely related to the AdaGrad policy \citep{duchi2011adaptive}, updates the sampling distribution on each round using the usual gradient step with an adaptive step size:
\begin{eqnarray} \label{oga-update}
	\bm{x}(t+1) \gets \Pi_{\Delta_N}\bigg(\bm{x}(t) + \frac{\bm{r}'(t)}{\sqrt{2 \sum_{\tau=1}^{t}||\bm{r}'(\tau)||_2^2}} \bigg).
\end{eqnarray}
%where the reward sequence $\{\bm{r}'(t)\}_{t\geq 1}$ is defined in Eq.\ \eqref{reward_def}.
%, and the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ evolve as in Eq. \eqref{q-ev}. 
In the above, the $\Pi_{\Delta_N}(\cdot)$ function, which denotes the Euclidean projection operator on the standard simplex $\Delta_N,$ can be efficiently implemented in $O(N\log N)$ time \citep{wang2013projection}. The complete \BQ policy in the full-information setting is summarized in Algorithm \ref{fair-MAB-full-info}.  


  \begin{algorithm}
\caption{\BQ Policy with full information}
\label{fair-MAB-full-info}
\begin{algorithmic}[1]
\State \algorithmicrequire{ Target reward rate vector $\bm{\vec{\lambda}}$,} Euclidean projection oracle $\Pi_{\Delta_N}(\cdot)$ onto the simplex $\Delta_N.$ 
\State $\bm{Q} \gets \bm{0}, \bm{x} \gets [\nicefrac{1}{N}, \nicefrac{1}{N}, \ldots, \nicefrac{1}{N}], V\gets \sqrt{T}, S\gets 0.$ \algorithmiccomment{\emph{Initialization}}
\ForEach {round $t=1:T$}
\State Sample an arm $I_t$ from the distribution $\bm{x}$.  
\State Observe the \emph{entire} reward vector $\bm{r}(t)$\algorithmiccomment{\emph{Full-information feedback}}
%\ForEach {arms $i \in \mathcal{P}$:}
\State 
%\begin{eqnarray*} 
	$Q_i=\big(Q_i+ \lambda_i - r_i(t)x_i\big)^+, ~\forall i\in \mathcal{P}. $\algorithmiccomment{\emph{Updating the queue lengths}}
%\end{eqnarray*}
%\EndForEach
\State $r'_i(t) \gets \big(Q_i + V\big)r_i(t), ~\forall i \in [N]$ \algorithmiccomment{\emph{Computing the surrogate rewards}}
\State $S \gets S + ||\bm{r}'(t)||^2.$ \algorithmiccomment{\emph{Accumulating the norm of the past gradients}}
	\State $\bm{x}\gets \Pi_{\Delta_N}\bigg(\bm{x}+ \frac{\bm{r}'(t)}{\sqrt{2S}} \bigg)$ \algorithmiccomment{\emph{Online gradient ascent}}
\EndForEach
\end{algorithmic}
\end{algorithm}
In our analysis, we will use the following standard second-order regret bound achieved by the \textsc{OGA} policy with the above adaptive step sizes. 
% \begin{framed} 
\begin{theorem}[\citet{orabona2019modern}, Theorem 4.14] \label{ref_th}
	Let $X \subseteq \mathbb{R}^{d}$ be a convex set with a finite Euclidean diameter $D.$ Consider an arbitrary sequence of linear reward functions with gradients $\{\bm{g}_t\}_{t \geq 1}.$ Assume that the Online Gradient Ascent policy is run with step sizes\footnote{Without any loss of generality, we set $\eta_t=0$ if $\bm{g}_t=0.$} $\eta_t= \frac{D}{\sqrt{2\sum_{\tau=1}^t|| \bm{g}_\tau||_2^2}}, 1\leq t \leq T.$ Then the regret of the policy can be upper-bounded as follows:
	\begin{eqnarray} \label{data-dep-bd}
	 \textrm{Regret}_T \leq D \sqrt{2\sum_{t=1}^T||\bm{g}_t||_2^2}. 		
	\end{eqnarray}
\end{theorem}  
%\end{framed}
 %for the problem $\Xi,$
 % which is defined by replacing the reward $\bm{r}(t)$ with $\bm{r}'(t)$ in Eq.\ \eqref{regret_def}:
 %A useful property of the above regret bound, which we exploit in our analysis, is that 
 It is important to note that the above bound is \emph{scale-free}, \emph{i.e.,} no \emph{a priori} bounds on the gradients are needed for the above result \citep{putta2022scale, hadiji2023adaptation}. Specializing Theorem \ref{ref_th} to our surrogate problem $\Xi$, we obtain the following regret bound, which depends on the sequence of queue variables:
\begin{eqnarray} \label{regret-bound}
	\textrm{Regret}^\Xi_t &\leq& 2 \sqrt{\sum_{\tau=1}^t \sum_i (Q_i(\tau-1)+V)^2r_i(t)^2} \nonumber\\
	&\leq&  2\sqrt{2\sum_{\tau=1}^t\sum_iQ_i^2(\tau) } + 2V \sqrt{2Nt}.
\end{eqnarray}
  In the above, we have used the fact that $0\leq r_{i}(t)\leq 1, \forall t,i,$ and the elementary inequalities $(a+b)^2 \leq 2(a^2+b^2),$ $\sqrt{x+y} \leq \sqrt{x}+ \sqrt{y}, x,y\geq 0$.

\subsection{Analysis} \label{analysis}
Unlike the analysis in \citet{patil2021achieving} and \cite{cai2018online}, which proceed by constructing stochastic confidence intervals for the mean rewards of each arm, we directly make use of the regret bound \eqref{regret-bound} via an "adversarial-style" analysis, which critically makes use of a new \emph{self-bounding} inequality derived below. Since the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ %corresponding to the arms in class $\mathcal{P}$ 
evolve according to the recursion \eqref{q-ev}, we do not immediately have an explicit control on the regret bound \eqref{regret-bound}, which depends on the queue lengths. Hence, to bound the regret, we take an indirect approach. Fix any feasible distribution $\bm{x}^* \in \Omega$.
From Eq. \eqref{potential_change}, we have
\begin{eqnarray*}
&&\Phi(\tau)-\Phi(\tau-1)  - 2V \sum_i r_i(\tau)x_i(\tau) \\
&\leq& 2+ 2 \sum_i Q_i(\tau-1) \lambda_i -\\
&&2 \sum_{i} \underbrace{(Q_i(\tau-1)+V) r_i(\tau)}_{r_i'(\tau)} x_i(\tau).
\end{eqnarray*}
%Taking conditional expectation of both sides with respect to the randomness of the reward vector $\bm{r}(\tau)$, we have 
%\begin{eqnarray*}
%	d
%\end{eqnarray*}
%
%Changing the dummy variable from $t$ to $\tau$ and 
Summing up the above inequalities from $\tau=1$ to $\tau=t$ and recalling that $\Phi(t)=\sum_i Q_i^2(t), \Phi(0)=0,$ we obtain
%from $\tau=1$ to $\tau=t$, 
%we obtain: 
\begin{eqnarray} \label{main_ineq}
	&&\sum_{i} Q_i^2(t) + 2 \sum_{\tau=1}^t V \sum_i r_i(\tau) (x_i^*-x_i(\tau)) \nonumber \\
	 &\leq& 2t + 2 \sum_{\tau=1}^t \sum_{i} Q_i(\tau-1) \big(\lambda_i-r_i(\tau) x_i^* \big)+2\textrm{Regret}^{\Xi}_t,\nonumber \\
	%&&+ \textrm{Regret}^{\Xi}_t \nonumber \\
	%&&\leq  2t +  4 \sqrt{2\sum_{\tau=1}^t\sum_{i} Q_i^2(\tau) } + 4\sqrt{2N\sum_{\tau=1}^t V_\tau^2}, \\
\end{eqnarray} 
where $\textrm{Regret}^{\Xi}_t$ denotes the worst-case regret for the surrogate problem (defined similarly as Eq.\ \eqref{regret_def}). 
Note that, in the above, the regret bound on the RHS is random as it depends on the magnitude of the random process $\{\bm{Q}(\tau)\}_{\tau}$. 
%In our analysis, we will exclusively consider a constant $\{V_t\}_{t\geq 1}$ sequence, where $V_t=V, \forall t\geq 1$ for some appropriate $V \geq 0$ to be fixed later.
Let $\{\mathcal{F}_\tau\}_{\tau \geq 0}$ be the natural filtration generated by the sequence of rewards $\{\bm{r}(\tau)\}_{\tau \geq 0}.$ Taking expectations, we have the following set of inequalities for any benchmark distribution $\bm{x}^* \in \Omega(\bm{\vec{\lambda}})$:
%Taking expectation of both sides with respect to the randomness of the reward process, we have:
\begin{eqnarray} \label{main-ineq3}
&&\sum_{i} \mathbb{E}Q_i^2(t) + 2V \textrm{Regret}_t (\bm{x}^*) \nonumber \\
		&=&\sum_{i} \mathbb{E}Q_i^2(t) + 2V \sum_{\tau=1}^t  \mathbb{E}\sum_i r_i(\tau) (x_i^*-x_i(\tau)) \nonumber \\
		 &\stackrel{(a)}{\leq} & 2t + 2 \sum_{\tau=1}^t \mathbb{E}\sum_{i} Q_i(\tau-1) \big(\lambda_i-x_i^*\mathbb{E}[r_i(\tau) |\mathcal{F}_{\tau-1}] \big)+ \nonumber \\
		 &&2 \mathbb{E} \big[\textrm{Regret}^{\Xi}_t\big]\nonumber\\
	 &\stackrel{(b)}{\leq}& 2t + 2 \sum_{\tau=1}^t \mathbb{E}\sum_{i} Q_i(\tau-1) \big(\lambda_i-\mu_i x_i^* \big)+2 \mathbb{E}\big[\textrm{Regret}^{\Xi}_t\big]\nonumber\\
	 &\stackrel{(c)}{\leq} & 2t + 2 \mathbb{E}\big[ \textrm{Regret}^{\Xi}_t\big]\nonumber\\
	&\stackrel{(d)}{\leq} & 2t +  4\sqrt{2\sum_{\tau=1}^t\sum_i \mathbb{E}Q_i^2(\tau) } + 4V \sqrt{2Nt} ,
\end{eqnarray}
where in (a), we have taken the expectation of both sides of \eqref{main_ineq} with respect to the i.i.d.\ reward process $\{\bm{r}(t)\}_{t \geq 1},$ and used the law of iterated expectations; in (b), we have used the i.i.d.\ nature of the reward process; in (c) we have used the feasibility condition of the benchmark $\bm{x}^*$ from Eq.\ \eqref{feas}; in (d), we have used the second-order regret bound from Eq.\ \eqref{regret-bound} in conjunction with Jensen's inequality for the square root function. We emphasize that step $(d)$ is the \emph{only} place where we use any property of the online learning subroutine. In other words, our reduction is \emph{universal} in the sense that any online learning subroutine for $\Xi$, which could be very different from OGA but has a data-dependent regret bound similar to \eqref{data-dep-bd}, can be used with \BQ.

 Inequality \eqref{main-ineq3} constitutes the key step in our analysis. It shows that the queue-length process $\{\bm{Q}(t)\}_{t \geq 1}$ possesses a \emph{self-bounding} property in the sense that the expected queue-length squared at any round $t$ is bounded by the square root of the sum of expected queue-length squared up to round $t$ plus other auxiliary terms. %Inequality \eqref{main-ineq3} leads to the following bound on the second moments of the queue variables.
% $\{\bm{Q}(t)\}_{t \geq 1}$. 
The regret decomposition inequality \eqref{main-ineq3} will be used to prove our main result in the full information setting.
%for a specific setting of the $\{V_t\}_{t \geq 1}$ sequence. 
\begin{theorem}\label{q_bd}
	%Upon setting $V=\Theta(\sqrt{T}), \forall t\geq 1$, 
	The \BQ policy described in Algorithm \ref{fair-MAB-full-info} achieves the following regret and rate violation bounds:
	\begin{eqnarray*}
		\textrm{Regret}_T = O(\max(\frac{T}{\sqrt{V}}, \sqrt{NT})), \mathbb{V}(T) = O(\sqrt{VT}).
	\end{eqnarray*}
	In particular, upon setting $V=\sqrt{T},$ we obtain 
		\begin{eqnarray*}
		\textrm{Regret}_T = O(\max(T^{\nicefrac{3}{4}},\sqrt{NT})), ~ \mathbb{V}(T) = O(T^{\nicefrac{3}{4}}).
	\end{eqnarray*}
		%we obtain $\mathbb{E}Q^2_i(t) = O(\sqrt{N}T^{3/2}), 1\leq t\leq T.$ 
\end{theorem} 
%\cmt{Improve the regret bound.}
The proof given below involves solving a non-linear sequential inequality to obtain a sublinear bound for the queue lengths. The resulting queue length bound is then used to control the regret. 
\begin{proof}
	First, we will derive a sublinear bound for the expected queue lengths under the \BQ policy. The rate violation and regret bounds will follow from this result.
\paragraph{1 (a). Bounding the queue lengths:}
Since the reward components are bounded in $[0,1],$ using the fact that $\sum_i r_i(\tau)(x_i(\tau)-x_i^*) \leq 1, \forall \tau,$ we have that $\textrm{Regret}_t(\bm{x}^*)\geq -t.$ Hence, from Eq.\ \eqref{main-ineq3}, we have for all $t\geq 1:$
\begin{eqnarray} \label{ineq1}
	\sum_i\mathbb{E}Q_i^2(t) \leq  2(V+1)t + 4 \sqrt{2\sum_{\tau=1}^t\sum_{i}\mathbb{E}Q_i^2(\tau) } \nonumber \\ + 4V \sqrt{2Nt}.
\end{eqnarray}
Hence, for any round $1\leq \tau \leq t,$ we have that 
\begin{eqnarray*}
		\sum_i \mathbb{E}Q_i^2(\tau) \leq 2(V+1)t + 4 \sqrt{2\sum_{\tau=1}^t\sum_{i}\mathbb{E}Q_i^2(\tau) } + 4V \sqrt{2Nt}.
\end{eqnarray*}
Summing up the above inequalities for all $\tau \in [1,t],$ we have 
\begin{eqnarray*}
	R^2(t) \leq 2(V+1)t^2 + 4 \sqrt{2N}V t^{\nicefrac{3}{2}} + 4\sqrt{2} t R(t).
\end{eqnarray*}
where we have defined $R(t) \equiv \sqrt{\sum_{\tau=1}^t \sum_{i=1}^N \mathbb{E}Q_i^2(\tau)}.$ Solving the above quadratic inequality in $R(t)$, we obtain
\begin{eqnarray} \label{r-bd}
	 R(t) = O(t)+ O(t\sqrt{V})+O(N^{\nicefrac{1}{4}}\sqrt{V}t^{\nicefrac{3}{4}})= O(t\sqrt{V}).
\end{eqnarray} 
Plugging the above bound in \eqref{ineq1}, we have for each $i \in \mathcal{P}:$
\begin{eqnarray} \label{q-sq-bd}
 &&\mathbb{E}Q_i^2(t) = O(Vt) + O(t\sqrt{V})+ O(V\sqrt{Nt}) = O(Vt) \nonumber \\
 && \stackrel{\textrm{(Jensen's ineq.)}}{\implies}  \mathbb{E}Q_i(t) = O(\sqrt{Vt}).	
\end{eqnarray}
\paragraph{1 (b). Bounding the rate violation penalty $\mathbb{V}(T)$:}
	Upon expanding \eqref{q-ev}, we obtain the following well-known representation for the Lindley recursion \citep[pp. 92]{asmussen2003applied}: 
	\begin{eqnarray} \label{q-len-bd}
		Q_i(t) = \sup_{1\leq \tau \leq t}(0, \lambda_i \tau - \sum_{z=t-\tau+1}^t r_i(z) x_i(z)), ~ \forall i \in \mathcal{P}.
	\end{eqnarray}
Combining Eq.\ \eqref{q-len-bd} with the bound \eqref{q-sq-bd}, we can bound the constraint violation penalty as 
	\begin{eqnarray*}
		\mathbb{V}(T) \leq \max_{i \in \mathcal{P}} \mathbb{E}Q_i(T) = O(\sqrt{VT}). 
	\end{eqnarray*} 
\paragraph{2. Bounding the regret:}
	Substituting \eqref{r-bd} into the inequality \eqref{main-ineq3} and using the fact that $Q_i^2(T) \geq 0, \forall i, t,$ we have for any $\bm{x}^* \in \Omega:$ 
	\begin{eqnarray*}
	2V \textrm{Regret}_T (\bm{x}^*)  
		\leq O(T) + O(T\sqrt{V})+ O(V\sqrt{NT}).
	\end{eqnarray*}
	This yields the following regret bound 
	\begin{eqnarray*}
		\textrm{Regret}_T(\bm{x}^*)  &=& O(\frac{T}{V})+ O(\frac{T}{\sqrt{V}})+ O(\sqrt{NT})\\
		&=& O(\max(\frac{T}{\sqrt{V}}, \sqrt{NT})).
	\end{eqnarray*}
\end{proof}

%\begin{proof}
%From Eq.\ \eqref{main-ineq3}, we have that for all $i \in \mathcal{P}$ and all $t\geq 1:$
%\begin{eqnarray} \label{ineq1}
%	\mathbb{E}Q_i^2(t) \leq 2(V+1)t + 4 \sqrt{2\sum_{\tau=1}^t\sum_{i}\mathbb{E}Q_i^2(\tau) } + 4V \sqrt{2Nt},
%\end{eqnarray}
%where we have used the fact that $\sum_i r_i(\tau)(x_i(\tau)-x_i^*) \leq 1, \forall \tau.$ Which implies that $\textrm{Regret}_t(\bm{x}^*)\geq -t.$ Furthermore, since $\lambda_i \leq 1,$ from Eq.\ \eqref{q-ev}, we trivially have $Q_i(\tau) \stackrel{a.s.}{\leq} \tau, \forall i,\tau$ We now improve upon this trivial upper bound on the queue lengths by substituting it on the RHS of Eq.\ \eqref{ineq1}, which yields:
%\begin{eqnarray*}
%	\mathbb{E}Q_i^2(t) &\leq& O(T^{3/2}) + O\big(\sqrt{N \sum_{\tau=1}^t \tau^2}\big)+ O(\sqrt{N} T) \\
%	&=& O(\sqrt{N} T^{3/2}),  \forall i, t \in [T].
%\end{eqnarray*}
%\end{proof}

\paragraph{Remarks:} It may appear from the statement of Theorem \ref{q_bd} that \BQ achieves a sub-optimal $O(T^{3/4})$ regret bound even for the standard regret minimization problem with no specified target reward rates, \emph{i.e.,} $\bm{\lambda}=\bm{0}.$ However, as we show in Section \ref{bq_no_lambda} of the Appendix, the \BQ policy actually achieves the optimal instance-independent $O(\sqrt{T})$ regret bound for both full-information and bandit feedback settings for $\bm{\lambda} = \bm{0}$. 

As an immediate corollary of Theorem \ref{q_bd}, the following result shows that under the action of the \textsc{BanditQ} policy, the target reward accrual rates are met asymptotically for each arm $i \in \mathcal{P}:$
%while incurring a reward violation penalty of $O(T^{3/4}).$ 
%Our result improves upon the $O(T^{5/6})$ violation penalty established by \citet[Theorem 1]{cai2018online} under independence assumptions.
\begin{proposition} \label{rate-prop}
	Upon setting $V=\sqrt{T},$ for any interval $\mathcal{I} \subseteq [T]$ such that $T^{3/4}=o(|\mathcal{I}|),$ the \textsc{BanditQ} policy in the full-information setting yields: 
	\[\liminf_{|\mathcal{I}| \to \infty} |\mathcal{I}|^{-1}\mathbb{E}\sum_{t \in \mathcal{I}} r_i(t)x_i(t) 
	  \geq \lambda_i, ~ \forall i \in \mathcal{P}.\]
\end{proposition}
See Appendix \ref{rate-prop-proof} for the proof. 
%\begin{proof}
%	Upon expanding \eqref{q-ev}, we obtain the following well-known representation of the Lindley recursion \citep{ross1995stochastic}: 
%	\begin{eqnarray} \label{q-len-bd}
%		Q_i(t) = \sup_{1\leq \tau \leq t}(0, \lambda_i \tau - \sum_{z=t-\tau+1}^t r_i(z) x_i(z)), ~ \forall i \in \mathcal{P}.
%	\end{eqnarray}
%	
%Using Proposition \ref{q_bd}, we have that $\mathbb{E}Q_i(t) \stackrel{\textrm{(Jensen's ineq.)}}{\leq}\sqrt{\mathbb{E}Q_i^2(t)} = O(T^{3/4}), ~\forall i\in \mathcal{P}, t\in [T].$
%	
%	Substituting the above bound in Eq.\ \eqref{q-len-bd}, we have for any $i \in \mathcal{P}:$
%	\begin{eqnarray*}
%		\inf_{1\leq t\leq T}\mathbb{E}\inf_{1\leq \tau \leq t}\big(\tau^{-1}\sum_{z=t-\tau+1}^t r_i(z) x_i(z)\big) \geq \lambda_i  - O(\frac{T^{3/4}}{\tau}),
%	\end{eqnarray*}
%	which gives a finite-time guarantee for the expected reward accrual rate for each arm in the protected set $\mathcal{P}.$
%	Hence, as long as $T^{3/4}/|\mathcal{I}| \to 0,$ we have
%	\begin{eqnarray*}
%		\liminf_{|\mathcal{I}| \to \infty} |\mathcal{I}|^{-1}\mathbb{E} \big[\sum_{t \in \mathcal{I}} r_i(t)x_i(t)\big] \geq \lambda_i, ~\forall i \in \mathcal{P}.
%	\end{eqnarray*} 
%	Finally, using Eq.\ \eqref{q-len-bd} once again, we can bound the violation penalty as 
%	\begin{eqnarray*}
%		\mathbb{V}(T) \leq \max_{i \in \mathcal{P}} \mathbb{E}Q_i(T) = O(T^{3/4}). 
%	\end{eqnarray*}
%\end{proof}
%Next we investigate the regret bound achieved by the \textsc{BanditQ} policy.
%Using Proposition \ref{q_bd} once again, we now derive a sublinear regret bound achieved by the \textsc{BanditQ} policy. 
%\begin{theorem}\label{regret_prop}
%Upon setting $V=\Theta(\sqrt{T}), 1 \leq t\leq T,$ the \textsc{BanditQ} policy achieves a regret bound of $O((NT)^{3/4})$ in the full-information setting.  
%%	\begin{eqnarray*}
%		%$\textrm{Regret}_T = O(T^{3/4}).$
%%	\end{eqnarray*}
%\end{theorem}
%See Appendix \ref{regret_prop_proof} for the proof.
%\begin{proof}
%	Substituting the queue length bound from Proposition \eqref{q_bd} into the inequality \eqref{main_ineq} and using the fact that $Q_i^2(T) \geq 0, \forall i, t,$ we have 
%	\begin{eqnarray*}
%		%&&2V \textrm{Regret}_T \\
%	2V \textrm{Regret}_T &\leq&  \sum_{i} \mathbb{E}Q_i^2(T) + 2V  \sum_{\tau=1}^T \sum_i r_i(\tau) (x_i^*-x_i(t)) \\ 
%		&\leq& 2T + O(T^{5/4}) + O(T)= O(T^{5/4}). 
%	\end{eqnarray*}
%	Since $V=\Theta(\sqrt{T}),$ the above inequality immediately yields
%	%\begin{eqnarray*}
%		$\textrm{Regret}_T = O(T^{3/4}).$
%	%\end{eqnarray*}
%\end{proof}
%\vspace{-15pt}
%Note that, unlike the standard MAB problem, in this case, the worst-case regret could be negative on some rounds. This stems from the fact that, unlike the offline benchmark, the \BQ policy is not \emph{required} to always take actions from the set $\Omega,$ which is unknown to the policy. This poses a technical difficulty in proving an $O(\sqrt{T})$ worst-case regret bound starting from Eq.\ \eqref{main-ineq3}.
Although Theorem \ref{q_bd} gives an $O(T^{\nicefrac{3}{4}})$ regret bound compared to the minimax regret bound of $O(\sqrt{T \log N})$ for the unconstrained problem \citep[Theorem 4]{zhao2019stochastic}, the next result shows that the proposed \BQ policy achieves a substantially stronger $O(\sqrt{T})$ bound for the
\emph{average} regret, where the regret is averaged over the entire time horizon $T$. 
 
\begin{proposition} \label{avg-regret}
	In the full information setting, under the \BQ policy with $V=\sqrt{T},$ we have 	%\begin{eqnarray*}
		$\frac{1}{T}\sum_{t=1}^T \textrm{Regret}_t(\bm{x}^*) = O(\sqrt{NT}),$ for any $\bm{x}^* \in \Omega$ and $\mathbb{V}(T)=O(T^{\nicefrac{3}{4}}).$ 
	%\end{eqnarray*}
\end{proposition}
%See Appendix \ref{avg-regret-proof} for the proof. 
\begin{proof}
	Define $S_t^2 \equiv \sum_i \mathbb{E} Q_i^2(t).$ From Eq.\ \eqref{main_ineq}, for all $t \in [T],$  we have 
\begin{eqnarray*}
	S_t^2 + 2V \textrm{Regret}_t(\bm{x}^*) &\leq& 2t + 4 \sqrt{2 \sum_{\tau=1}^t S_\tau^2} + 4V \sqrt{2Nt} \\
	&\leq& 2T+ 4 \sqrt{2 \sum_{\tau=1}^T S_\tau^2}+4V \sqrt{2NT}.
\end{eqnarray*}
Summing up the above inequalities from $t=1$ to $t=T$ and defining $z_T\equiv \sqrt{\sum_{\tau=1}^T S_\tau^2},$   we obtain
\begin{eqnarray} \label{avg-ineq2}
 z_T^2 - 4Tz_T + 2V \sum_{t=1}^T\textrm{Regret}_t(\bm{x}^*) \leq 2T^2 + 4V\sqrt{2N}T^{3/2}.
\end{eqnarray}
Upon completing the square, we have  $z_T^2 - 4Tz_T = (z_T-2T)^2 -4T^2 \geq -4T^2. $ Hence, from \eqref{avg-ineq2}, we conclude that: 
\begin{eqnarray*}
	 \frac{1}{T}\sum_{t=1}^T\textrm{Regret}_t(\bm{x}^*) \leq 3\frac{T}{V} + 2\sqrt{2NT}.
\end{eqnarray*}
The final result follows upon setting $V=\sqrt{T}.$ 
\end{proof}
%The reader should compare the above bound with the  $\Omega(\sqrt{T\log N})$ minimax regret lower bound for stochastic rewards with the full-information feedback \citep[Theorem 4]{zhao2019stochastic}.  
 Finally, if one is only interested in achieving the target rate vector $\vec{\bm{\lambda}}$ while completely disregarding the regret, the following Proposition shows that the queue-length bound, and hence, the rate violation penalty given in Proposition \ref{q_bd} can be further improved to $O(\sqrt{T})$ upon setting $V=0.$ 
 \begin{proposition} \label{improved_bd}
 The cumulative constraint violation under the \BQ policy in the full-information setting with $V=0$ can be bounded as follows:
 %the second and first moments of the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ can be bounded as:
% \[\mathbb{E}Q_i^2(t) \leq 64Nt \implies \mathbb{E}Q_i(t) \leq 8\sqrt{Nt}, ~ \forall i \in \mathcal{P}, \forall t\geq 1.\]
\[ \mathbb{V}(T) \leq \max_i \mathbb{E}Q_i(T) \leq 6 \sqrt{T}. \]
 \end{proposition}
 \begin{proof}
 From Eq.\ \eqref{main-ineq3}, we have for any fixed $t$ and any $1\leq \tau \leq t:$
\begin{eqnarray} \label{new-eq}
	\sum_i \mathbb{E}Q_i^2(\tau) \leq 2t + 4 \sqrt{2\sum_{\tau=1}^t\sum_{i} \mathbb{E}Q_i^2(\tau) }~~ \forall t \geq 1, \forall i.
\end{eqnarray}
Summing up the above inequalities for $1\leq \tau \leq t$ and defining $z_t^2 \equiv  \sum_{\tau=1}^t \sum_{i} \mathbb{E}Q_i^2(\tau),$ we have 
\begin{eqnarray*}
	z_t^2 \leq 2t^2 + 4 \sqrt{2}tz_t.
\end{eqnarray*}
Solving the above quadratic inequality, we conclude that 
\begin{eqnarray*}
	\sqrt{\sum_{\tau=1}^t \sum_{i} \mathbb{E}Q_i^2(\tau)} = z_t \leq  6t. 
\end{eqnarray*}
Substituting the above bound in \eqref{new-eq} and using Jensen's inequality, we conclude that $\mathbb{E}Q_i(t) \leq 6\sqrt{t}, \forall i \in [N].$
\end{proof}
% See Appendix \ref{improved_bd_proof} for the proof. 
 %In the setting of Proposition \ref{improved_bd}, we obtain $\mathbb{V}(T) = O(\sqrt{NT})$.
  
  \paragraph{Sharper regret bound under a monotonicity assumption:} \label{stronger-bd}
  The regret and constraint violation bounds derived above hold unconditionally. We now show that the \BQ policy achieves the minimax optimal $O(\sqrt{T})$ regret under a mild monotonicity assumption on the queue length sequence stated below.  
  %These bounds can be strengthened under the following assumption.
%  \begin{assumption}[(Non-negativity of the regret)] \label{non-neg-regret}
%  	Assume that there exists some feasible distribution $\bm{p}^* \in \Omega(\bm{\vec{\lambda}})$ such that under the action of the \BQ policy, we have  $\textrm{Regret}_t(\bm{p}^*) \geq 0, \forall t \geq 1.$ Note that $\bm{p}^*$ need not be known to the policy.
%  \end{assumption}
\begin{assumption}[Monotonicity in expectation] \label{mon-q}
	Under the action of the chosen OLO subroutine, the sequence of variables $Q^2(t)\equiv \sum_i \mathbb{E}Q_i^2(t), t \geq 1$ are non-decreasing in $t$.
\end{assumption}
  %\citet{?} made a similar assumption.
  \begin{theorem} \label{mon-q-thm}
  	Under Assumption \ref{mon-q}, the regret of the \BQ policy in the full-information setting is bounded as \[\textrm{Regret}_t\leq \frac{5t}{V}+ 2 \sqrt{2Nt}, ~1\leq t \leq T.\]
  	In particular, with $V=\sqrt{T},$ we have $\textrm{Regret}_t = O(\sqrt{Nt})$ for any $t \in [T].$ 
  	\end{theorem}
%  	\begin{eqnarray*}
%		\textrm{Regret}_T(\bm{x}^*) = O(\sqrt{NT}), ~ \mathbb{V}(T) = O(N^{\nicefrac{1}{4}}\sqrt{T}), \forall \bm{x}^*\in \Omega.
%	\end{eqnarray*}
%  \end{theorem}
   See Appendix \ref{mon-q-thm-pf} for the proof. Assumption \ref{mon-q} is related to a stochastic monotonicity assumption. Many closely related Markov chains, \emph{e.g.,} the birth-death chain, which is a continuous-time model of a queue with zero initial states, are known to be stochastically monotone \citep[Proposition 9.2.4]{ross1995stochastic} \citep[Theorem 6.1]{van1980stochastic}, \citep{keilson1977monotone}.  
   %assumption (Assumption \ref{q-mon-bandit2}), which we discuss further in the bandit setup.

 %All of the above result holds without imposing any additional restriction on the problem. The following proposition shows that if there exists an optimal action $\bm{x}^* \in \Omega$ such that the pseudo-regret $\textrm{Regret}_t(\bm{x}^*)$ is non-negative throughout (except, possibly, a finite number of rounds), then the regret can be improved to the optimal minimax rate $O(\sqrt{T}).$
 
 %\cmt{Discuss what happens when there exists an $\bm{x}^*$ with positive regret throughout.}

















