\section{Introduction} \label{statement} 
In this paper, we consider the classic Prediction with Experts advice problem (\emph{a.k.a} the Prediction problem) in the adversarial setting with full information in the presence of rate constraints for pulling the arms. We present a new framework, based on elementary queueing theory to solve this problem. In particular, we consider two concrete problems, namely the Fair Experts problem and the Bandit with Knapsacks problem and show that how these problems can be simultaneously solved using the same queueing framework.  


Most of the learning algorithms deployed in practice aim to maximize the cumulative profit (\emph{e.g.,} the number of clicks on an ad). Consequently, they could end up discriminating against a subset of users the algorithm believes less rewarding. In a typical case of algorithmic discrimination, Facebook was sued for targeting ads on housing, credit and employment by race, gender, and religion - all protected classes under US law \cite{hao2019facebook}. A similar problem of fair allocation of resources arises in wireless scheduling, where schedulers maximizing the sum rate could result in not serving a subset of users having relatively poor channel conditions. The classic Proportional Fair scheduler solves the fair scheduling problem when the current channel conditions are known \cite{stolyar2005asymptotic}. However, maximizing the sum rate with user-specific rate constraints - a central problem in 5G network slicing remains open under unknown and adversarial channel conditions. In the context of online learning, several papers have considered the fair allocation problem in the stochastic setting. The paper \cite{huang2023queue} studied the scheduling problem in the adversarial environment in the context of stabilizing the queues. However, they do not consider the regret minimization problem under the queue stability constraints.
%, which can be trivially solved in our setting by a randomized policy. 

\subsection{Related work} \label{related_work}
The paper \cite{immorlica2022adversarial} uses a variant of the stochastic Bandit with Knapsack (\texttt{BwK}) policy to the adversarial setting. On the other hand, our policy is strictly tailored to the adversarial set up which, depends on the adaptive regret bound of the Online Gradient Descent policy. It is non-trivial and currently open whether a variant of our proposed policy can also be made to work in the stochastic set-up. 

\subsection{Technical Innovation} \label{techs}
To the best of our knowledge, all of the previously proposed online policies for the \BwK problem rely on the \emph{zero order} $O(\sqrt{T})$ regret bound of their online prediction subroutines. As an example, the algorithm proposed by \cite{immorlica2022adversarial} uses \texttt{EXP3.P} and \texttt{Hedge} as subroutines to their online primal-dual framework. On the other hand, our proposed \texttt{BanditQ} policy exploits a second-order regret bound that depends on the sum of squares of gradient components. A major obstacle with this approach is that the gradient components are not bounded a priori and depends on the past actions of the online policy itself. A key technical contribution of this paper is to simultaneously control both the norm of the gradients and the regret bound. We believe our proof technique is new and could be useful elsewhere.

%We consider the standard no-regret prediction problem with expert's advice in the adversarial setting with an additional long-term rate constraint for a subset of experts. 

%\textcolor{red}{Replace agents by users?}

 
%More specifically, we use the strong $(w, \epsilon)$-admissibility criterion with respect to a fixed schedule. The problem is interesting both in the full information and bandit setting and we tackle the full information version first. 

%We will also consider the simpler problem where there is a  given rate vector $\bm{r}$ such that arm $i$ must be pulled at least $r_i$ fraction of times. We seek a uniform algorithmic framework for both these problems. 


\section{Problem Formulation}  \label{model} 
In this paper, we consider the no-regret online prediction problem in the adversarial setting with an additional fairness constraint that requires that each agent in a protected class $\mathcal{P}$ must get a guaranteed rate of accrual of rewards, assuming the rates to be feasible. Formally, assume that there is a set of $N$ users, which on round $t$, receives the reward vector $\bm{r}(t),$ where $0\leq r_i(t) \leq 1, \forall i,t.$ On round $t$, an online prediction policy samples a user with a probability distribution $\bm{x}(t),$ where $\bm{x}(t)$ belongs to the standard probability simplex on $N$ atoms, denoted by $\Delta_N$. This short paper focuses on the full-information setting where the entire reward vector is revealed to the learner at the end of each round. Due to the action of the policy, the user $i$ receives an (expected) reward of $x_i(t)r_i(t), \forall i \in [N]$ and the online policy gets a reward of $\langle \bm{x}(t), \bm{r}(t) \rangle.$ Let $\mathcal{P} \subseteq [N]$ be a subset of $k$ users who belong to the \emph{protected classes}. We require that the long-term rate of rewards accrued by the $i$\textsuperscript{th} user in the class $\mathcal{P}$ must be at least $\lambda_i >0, \forall i\in \mathcal{P}.$ We emphasize that guaranteeing lower bounds to the rate of reward accruals of the individual users is different and more challenging than just constraining the frequency of sampling each user without considering their accrued rewards. To make the rate constraints feasible, 
%we assume that there is an offline probability distribution $\bm{x}^*$ over the users with the above property. 
we assume that there exists a feasible offline static allocation $\bm{x}^* \in \Delta_N$ that achieves the desired reward rates on all rounds. In other words, it is ensured that
\begin{eqnarray} \label{feas}
	x_i^* r_{i}(t) \geq \lambda_i, ~\forall i \in \mathcal{P}, ~\forall t \in [T].
\end{eqnarray}  
The above requirement can be compactly expressed as: 
\begin{eqnarray} \label{rate-requirement}
	x_i^* \geq \frac{\lambda_i}{\min_{1\leq t\leq T} r_i(t)},~ \forall i \in \mathcal{P}. 
\end{eqnarray}
Although the constraint in Eq.\ \eqref{rate-requirement} seems strong, considering a number of consecutive rounds at a time, our results hold even when the denominator in Eq.\ \eqref{rate-requirement} is replaced with the minimum rate averaged over any constant-sized window of size $w \geq 1.$ The proposed \texttt{BanditQ} policy is oblivious to the value of $w$.
The set of all feasible offline prediction distributions, given by the intersection of \eqref{rate-requirement} and the standard simplex $\Delta_N$, is denoted by $\Omega.$ The set $\Omega$ can be easily verified to be closed and convex having a diameter $D=\sqrt{2}.$ Our goal is to design an online user sampling policy $\{\bm{x}(t)\}_{t\geq 1}$ that achieves a sublinear regret against any benchmark in $\Omega$ over a horizon of a given length $T,$ \emph{i.e.,}
\begin{eqnarray} \label{regret_def}
	\textrm{Regret}_T \equiv \max_{\bm{x}^* \in \Omega}\sum_{t=1}^T \langle \bm{x}^*, \bm{r}(t)\rangle - \sum_{t=1}^T \langle \bm{x}(t), \bm{r}(t)\rangle,
\end{eqnarray}
while ensuring the long-term rate constraints. Formally, for any time interval $\mathcal{I} \subseteq [T]$, the rate constraint requires:
\begin{eqnarray*}
	\liminf_{|\mathcal{I}| \to \infty} |\mathcal{I}|^{-1}\sum_{t \in \mathcal{I}} r_i(t)x_i(t)  \geq \lambda_i, ~\forall i \in \mathcal{P}.  
\end{eqnarray*}
%In this paper, we will give a stronger \emph{uniform} guarantee of the reward rate achieved by our proposed policy.
We note two fundamental differences between the above problem and the standard Online Convex Optimization (OCO) framework \cite{orabona2019modern}. First, contrary to the OCO setting, where the benchmark set $\Omega$ is specified a priori independent of the rewards, in this problem, the set $\Omega$ depends on the rewards and remains unknown till the end of the horizon. Second, unlike the OCO setting, the action taken by the online policy on a round is not restricted to the set $\Omega$ provided that the long-term reward rate requirements of the protected classes are met.     
%Clearly, for the constraint \eqref{rate-requirement} to be feasible, it is 
The following conditions on rates are necessary and sufficient for ensuring feasibility \eqref{feas}: 
\begin{eqnarray} \label{nec-suff}
	\sum_{i \in \mathcal{P}} \frac{\lambda_i}{\min_{1\leq t\leq T} r_i(t)} \leq 1, ~\textrm{and} 
	~\lambda_i \leq \min_{1\leq t\leq T} r_i(t), \forall i\in \mathcal{P}.
\end{eqnarray}
In practice, the feasibility of a set of rate requirements can often be verified by assuming stationary rewards and using the historical reward data sequence. 