\section{The Policy: \large{\texttt{BanditQ}}}  \label{algorithm}
In this section, we propose \texttt{BanditQ} - an online learning policy that solves the above constrained prediction problem. 
On a high level, we define a natural queueing dynamics to account for the rate constraints. Then we extend the drift-plus-penalty framework of \cite{neely2010stochastic} to simultaneously achieve a small regret and ensure that the long-term rate constraints are met for the users in the subset $\mathcal{P}.$ However, to make this approach work, we must adapt the stochastic full-information setting of \cite{neely2010stochastic} to the adversarial set-up with online information. This extension is non-trivial and requires a new proof and algorithmic technique, which is very different from the celebrated \texttt{Max-Weight} policy.

We associate a non-negative state variable $Q_i(t)$ corresponding to each user $i$ in the protected set $\mathcal{P}.$ Under the action of an online policy $\pi = \{\bm{x}(t)\}_{t \geq 1},$ the state variables in the set $i \in \mathcal{P}$ evolve according to the following queueing dynamics known as the Lindley recursion:
\begin{eqnarray} \label{q-ev}
	Q_i(t)=\big(Q_i(t-1)+ \lambda_i - r_i(t)x_i(t)\big)^+, ~Q_i(0)=0,
\end{eqnarray}
where we adopt the usual notation $(y)^+\equiv \max(0,y).$ To get an intuition on the above dynamics, imagine that on every round $t$ a fixed deterministic amount of work $\lambda_i$ arrives at the queue $Q_i.$ Then, under the action $\bm{x}(t)$ of an online policy, $\min(Q(t-1)+ \lambda_i, r_i(t) x_i(t))$ amount of work departs from $Q_i.$ Hence, any online policy stabilizing the queues satisfies the long-term rate requirements. However, since we are also interested in achieving a small regret, meeting the rate constraints alone is not enough (\emph{c.f.} \cite{huang2023queue}). Our online policy must also achieve a small regret in terms of cumulative rewards against feasible stationary policies given by \eqref{feas}. Towards this, we define the following quadratic potential function (\emph{a.k.a.} Lyapunov function in queueing theory):
\begin{eqnarray*}
	\Phi(t) = \sum_{i \in \mathcal{P}} Q_i^2(t).  
\end{eqnarray*}
We now upper-bound the change of potential under the action of a policy. We have 
\begin{eqnarray*}
	Q_i^2(t) &\leq& \big(Q_i(t-1)+ \lambda_i - r_i(t)x_i(t)\big)^2 \\
	&\leq& Q_i^2(t-1) + \lambda_i + x_i(t)+ 2Q_i(t-1)(\lambda_i - r_i(t) x_i(t)), 
\end{eqnarray*}
where we have used the fact that $0\leq \lambda_i, r_i(t), x_i(t) \leq 1, \forall i, t.$ Summing up the above inequality, we have the following expression for the change of potential on round $t$:
\begin{eqnarray} \label{potential_change}
	\Phi(t) - \Phi(t-1) \leq 2 + 2\sum_{i \in \mathcal{P}} Q_i(t-1)\big(\lambda_i - r_i(t) x_i(t)\big),
\end{eqnarray}
where we have used the fact that $\sum_i \lambda_i \leq 1, \sum_i x_i(t) \leq 1.$ Motivated by the drift-plus-penalty framework of \cite{neely2010stochastic}, we now define an instance of the online prediction problem $\Xi$ where the reward on round $t$ of the $i$\textsuperscript{th} user is defined as 
\begin{eqnarray} \label{reward_def}
	r'_i(t) \equiv \big(Q_i(t-1) + V_t\big)r_i(t), ~\forall i \in [N].
\end{eqnarray}
In the above $\{V_t\}_{t\geq 1}$ is a sequence of non-negative non-decreasing parameters to be fixed later. We set $\lambda_i=0, Q_i(t)=0, \forall t, \forall i \notin \mathcal{P}$. 
 %To recapitulate the information structure, the coefficients $Q_i(t-1)+V_t$ are known, but the reward vector $\bm{r}(t)$ is unknown to the online policy before it makes the decision $\bm{x}(t) \in \Delta_N$ on round $t$. 
 Intuitively, the surrogate reward $\bm{r}'(t)$ defined in \eqref{reward_def} strikes a balance between meeting the target rates (through the first term) and achieving a small regret (through the second term).
 Our proposed \texttt{BanditQ} policy simply uses any adaptive no-regret policy, \emph{e.g.,} Online Gradient Ascent (\texttt{OGA}) \cite{orabona2019modern} or \texttt{Squint} \citep{koolen2015second}, for the auxiliary problem $\Xi$ defined above. For simplicity, we use the \texttt{OGA} policy, which updates the sampling distribution as follows:
\begin{eqnarray} \label{oga-update}
	\bm{x}(t+1) \gets \Pi_{\Delta_N}\bigg(\bm{x}(t) + \frac{\bm{r}'(t)}{\sqrt{2 \sum_{\tau=1}^{t}||\bm{r}'(\tau)||_2^2}} \bigg),
\end{eqnarray}
where the reward sequence $\{\bm{r}'(t)\}_{t\geq 1}$ is defined in \eqref{reward_def}.
%, and the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ evolve as in Eq. \eqref{q-ev}. 
In Eq.\ \eqref{oga-update}, $\Pi_{\Delta_N}(\cdot)$ denotes the Euclidean projection operator on the standard simplex $\Delta_N$. 
 In our analysis, we only need the following adaptive bound \cite[Theorem 4.14]{orabona2019modern} achieved by \texttt{OGA} on the regret for the problem $\Xi,$ which is defined by replacing the reward vector $\bm{r}(t)$ with $\bm{r}'(t)$ in Eq.\ \eqref{regret_def}:
\begin{eqnarray} \label{regret-bound}
	\textrm{Regret}^\Xi_t \leq 2 \sqrt{\sum_{\tau=1}^t \sum_i (Q_i(\tau-1)+V_\tau)^2r_i(t)^2} 
	\leq  2\sqrt{2\sum_{\tau=1}^t\sum_iQ_i^2(\tau) } + 2 \sqrt{2N\sum_{\tau=1}^t V_\tau^2},
\end{eqnarray}
   where we have used the fact that $0\leq r_{i}(t)\leq 1, \forall t,i$.

\subsection{Analysis and Regret Bounds}  
Since the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ corresponding to the users in class $\mathcal{P}$ evolve according to the recursion \eqref{q-ev}, we do not immediately have explicit control of the magnitude of the reward vector $\bm{r}'(t)$ and the regret bound for the auxiliary problem $\Xi.$ To control the queue lengths, we take an indirect route by making use of the regret bound \eqref{regret-bound}. Consider any stationary action $\bm{x}^* \in \Omega$.
From Eq. \eqref{potential_change}, we have 
\begin{eqnarray*}
\Phi(\tau)-\Phi(\tau-1)  - 2V_\tau \sum_i r_i(\tau)x_i(\tau) 
\leq 2+ 2 \sum_i Q_i(\tau-1) \lambda_i -2 \sum_{i} \underbrace{(Q_i(\tau-1)+V_\tau) r_i(\tau)}_{r_i'(\tau)} x_i(\tau).
\end{eqnarray*}
%Changing the dummy variable from $t$ to $\tau$ and 
Summing up the above inequality from $\tau=1$ to $\tau=t$ and recalling that $\Phi(t)=\sum_i Q_i^2(t), \Phi(0)=0$, 
%from $\tau=1$ to $\tau=t$, 
we obtain: 
\begin{eqnarray} \label{main_ineq}
	&& \sum_{i} Q_i^2(t) + 2 \sum_{\tau=1}^t V_\tau \sum_i r_i(\tau) (x_i^*-x_i(\tau))  \nonumber \\
	&\leq& 2t + \underbrace{2 \sum_{\tau=1}^t \sum_{i} Q_i(\tau-1) \big(\lambda_i-r_i(\tau) x_i^* \big)}_{\leq 0 ~(\textrm{from Eq.}\ \eqref{feas})}+2\textrm{Regret}^{\Xi}_t \nonumber\\
	%&&+ \textrm{Regret}^{\Xi}_t \nonumber \\
	&\leq&  2t +  4 \sqrt{2\sum_{\tau=1}^t\sum_{i} Q_i^2(\tau) } + 4\sqrt{2N\sum_{\tau=1}^t V_\tau^2},
\end{eqnarray} 
where we have used the feasibility of $\bm{x}^*$ from Eq.\ \eqref{feas} and the adaptive regret bound from Eq.\ \eqref{regret-bound}. Next, we claim the following deterministic bound on the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ by choosing a constant parameter sequence. 
%for a specific setting of the $\{V_t\}_{t \geq 1}$ sequence. 
\begin{proposition}\label{q_bd}
	Setting $V_t=V=\Theta(\sqrt{T}), \forall t\geq 1$, we obtain $Q_i(t) = O(T^{3/4}), \forall i \in \mathcal{P}, 1\leq t\leq T.$ 
\end{proposition} 
\begin{proof}
From Eq.\ \eqref{main_ineq}, we have that for all $i \in \mathcal{P}$ and all $t\geq 1:$
\begin{eqnarray} \label{ineq1}
	Q_i^2(t) \leq 2(V+1)t + 4 \sqrt{2\sum_{\tau=1}^t\sum_{i}Q_i^2(\tau) } + 4V \sqrt{2Nt},
\end{eqnarray}
where we have used the fact that $\sum_i r_i(\tau)(x_i(\tau)-x_i^*) \leq 1, \forall \tau.$ Furthermore, since $\lambda_i \leq 1,$ we trivially have $Q_i(\tau) \leq \tau, \forall i,\tau.$ We now improve upon this trivial upper bound by substituting it in Eq.\ \eqref{ineq1}, yielding:
\begin{eqnarray*}
	Q_i^2(t) \leq O(T^{3/2}) + O(T^{3/2})+ O(T) = O(T^{3/2}),  \forall i, t.
\end{eqnarray*}
\emph{i.e.,}
$Q_i(t)=O(T^{3/4}), \forall i, 1\leq t\leq T.$ 
%
%By substituting the above bound again in Eq.\ \eqref{ineq1}, we have the following improved estimate:
%\begin{eqnarray*}
%	Q^2_i(T) \leq 
%\end{eqnarray*}
\end{proof}
The following proposition shows that the rate constraints of the users belonging to the protected classes are met.
\begin{proposition} \label{rate-prop}
	Under the \texttt{BanditQ} policy with $V_t=V=\Theta(\sqrt{T}),$ for any interval $\mathcal{I} \subseteq [T]$ such that $T^{3/4}=o(|\mathcal{I}|),$ we have 
	\[\liminf_{|\mathcal{I}| \to \infty} |\mathcal{I}|^{-1}\sum_{t \in \mathcal{I}} r_i(t)x_i(t) 
	  \geq \lambda_i, ~\forall i \in \mathcal{P}.\]
\end{proposition}
\begin{proof}
	Upon expanding \eqref{q-ev}, we obtain the following well-known representation for the Lindley recursion: 
	\begin{eqnarray*}
		Q_i(t) = \max_{1\leq \tau \leq t}(0, \lambda_i \tau - \sum_{z=t-\tau+1}^t r_i(z) x_i(z)).
	\end{eqnarray*}
	Substituting the bound from Proposition \ref{q_bd} in the above, we have for any $1\leq t\leq T$ and any $1\leq \tau \leq t:$
	\begin{eqnarray*}
		\tau^{-1}\sum_{z=t-\tau+1}^t r_i(z) x_i(z) \geq \lambda_i  - O(\frac{T^{3/4}}{\tau}),
	\end{eqnarray*}
	which gives a finite-time rate guarantee for each user $i \in \mathcal{P}.$
	Hence, as long as $T^{3/4}/|\mathcal{I}| \to 0,$ we have
	\begin{eqnarray*}
		\liminf_{T \to \infty} |\mathcal{I}|^{-1}\sum_{t \in \mathcal{I}} r_i(t)x_i(t)  \geq \lambda_i, ~\forall i \in \mathcal{P}.
	\end{eqnarray*} 
\end{proof}
%Next we investigate the regret bound achieved by the \texttt{BanditQ} policy.
Using Proposition \ref{q_bd}, we now derive a sublinear regret bound achieved by the \texttt{BanditQ} policy. 
\begin{proposition} \label{regret_prop}
Upon setting $V_t=V=\Theta(\sqrt{T}), 1 \leq t\leq T,$ the \texttt{BanditQ} policy achieves $O(T^{3/4})$ regret as defined in Eq.\ \eqref{regret_def}:  
%	\begin{eqnarray*}
		%$\textrm{Regret}_T = O(T^{3/4}).$
%	\end{eqnarray*}
\end{proposition}
\begin{proof}
	Substituting the queue length bound from Proposition \eqref{q_bd} into the inequality \eqref{main_ineq} and using the fact that $Q_i^2(T) \geq 0, \forall i, t,$ we have 
	\begin{eqnarray*}
		%&&2V \textrm{Regret}_T \\
	2V \textrm{Regret}_T &\leq&  \sum_{i} Q_i^2(T) + 2V  \sum_{\tau=1}^T \sum_i r_i(\tau) (x_i^*-x_i(t)) \\ 
		&\leq& 2T + O(T^{5/4}) + O(T)= O(T^{5/4}). 
	\end{eqnarray*}
	Since $V=\Theta(\sqrt{T}),$ the above inequality immediately yields
	%\begin{eqnarray*}
		$\textrm{Regret}_T = O(T^{3/4}).$
	%\end{eqnarray*}
\end{proof}
%\vspace{-15pt}
 If one is only interested in satisfying the rate constraints while disregarding the cumulative rewards, the following proposition shows that the bound in Proposition \ref{q_bd} can be further improved to $O(\sqrt{t})$ by setting $V_t=0, \forall t.$ 
 \begin{proposition} \label{improved_bd}
 Setting $V_t=0, \forall t \geq 1,$ the state variables $\{\bm{Q}(t)\}_{t \geq 1}$ can be bounded as:
 \[Q_i(t) \leq 16\sqrt{kt}, ~\forall t\geq 1.\]
 \end{proposition}
 See Appendix \ref{improved_bd_proof} for the proof. 
%This implies a tighter finite-time bounds for the rate. 
%To show this, from Eq.\ \eqref{main_ineq}, we have:
%\begin{eqnarray} \label{new-eq}
%	Q_i^2(t) \leq 2t + 4 \sqrt{2\sum_{\tau=1}^t\sum_{i} Q_i^2(\tau) }~~ \forall t, \forall i \in \mathcal{P}.
%\end{eqnarray} 
%Substituting the trivial bound $Q_i(\tau) \leq \tau, \forall i, \tau,$ on the RHS of the above inequality, we have the following improved bound on the queue lengths:
%\begin{eqnarray*}
%	Q_i^2(t) \leq 2t+ O(t^{3/2}) = O(t^{3/2}),
%\end{eqnarray*}
%\emph{i.e.,} $Q_i(t) = O(t^{3/4}), \forall i,t.$ Plugging back the improved queue-length estimates in \eqref{new-eq}, we improve the bound further to $Q_i(t) = O(t^{5/8}), \forall i,t.$ Continuing this refinement process, we finally arrive at the bound $Q_i(t)= O(\sqrt{t}), \forall i \in \mathcal{P}, t.$ This result immediately improves the bound in Proposition \ref{rate-prop}, where we can take the averaging interval to be as small as $O(\sqrt{T}).$
%



















