\section{Introduction}   \label{statement} 
A vast majority of the multi-armed bandit (MAB) algorithms deployed in practice are designed to maximize the cumulative rewards. Consequently, these algorithms could end up systematically avoiding a subset of arms (which could represent users with certain demographic characteristics or historical activities) that the algorithm finds less rewarding \citep{sweeney2013discrimination}. In a typical case of algorithmic discrimination, Facebook was sued for targeting ads on housing, credit and employment based on race, gender, and religion - all protected classes under US law \citep{hao2019facebook}. A similar problem of fair allocation of resources arises in wireless settings, where schedulers maximizing the total throughput could result in not serving a subset of users having relatively poor channels. 
%Since the rewards are stochastic, playing an arm may result in random rewards. 
A number of papers have proposed a solution to the fairness problem by putting an explicit constraint on the \emph{minimum frequency} of pulls for each arm. However, in many problems of practical interest, the algorithm designer is interested in guaranteeing a minimum rate of \emph{reward accrual} for each arm - not just ensuring a minimum frequency at which the arms ought to be pulled. 

\textbf{Examples:} (1) In online ad allocation, the advertisers are primarily interested in maximizing their click-through rate, which fetches them monetary rewards,  rather than just the number of times their ads are displayed against a search result. (2) In wireless scheduling problems, the users, who correspond to the bandit's arms in our formulation, are interested in guaranteed data rates rather than their frequency of scheduling - a low-level metric transparent to the users. 
%In online crowd-sourcing platforms, workers are primarily interested in the amount of money they make over a given span of time compared to the number of times they are assigned a job. 
(3) As a final example, consider a crowdsourcing platform (e.g., Amazon Mechanical Turk) where the workers receive payments for performing tasks \citep{fu2021fairness}. Upon completing each task, the platform receives a fixed percentage of the payment as revenue. The goals of the platform are - (a) to allocate the oncoming tasks fairly among the workers and (b) to maximize the platform's total revenue. In our formulation, the workers correspond to the arms, and the revenue maximization problem (b) becomes equivalent to the regret minimization problem. However, without the fairness requirement (a), the platform would assign most of the jobs to the best-performing workers, effectively ignoring a vast majority of the registered workers who may leave the platform dissatisfied. Hence, the platform may suffer from a high attrition rate. One possible way to enhance the retention rate of the workers and make the platform non-discriminating is to ensure a guaranteed reward rate (equivalent to a minimum wage) for each registered worker. In this paper, we will see that the proposed \textsc{BanditQ} policy gives an efficient solution to each of the above problems. 
%Similar problems also arise in other settings, including user scheduling in wireless networks where the network operator wants to ensure guaranteed bit rates to the users.

Clearly, the rate of reward accruals of the arms depends on the unknown reward distribution, which needs to be learned along the way. In this paper, we solve this fair prediction problem in the stochastic setting via a black-box reduction to an adversarial MAB problem by making use of a natural queueing dynamics to keep track of the target rates. Although we consider i.i.d. rewards, we will see that the use of adversarial MAB sub-routines is essential to account for the target reward rate constraints.   

\subsection{Related Works} \label{related}
There is extensive literature on the classic Multi-armed Bandits (MAB) problem, where the objective is to sequentially play an arm on each round from a given set of arms with  unknown reward distributions to maximize the cumulative reward. As the feedback is limited to the observed rewards only, the MAB problem naturally involves an exploration vs exploitation trade-off. See \citet{cesa2006prediction, bubeck2012regret, lattimore2020bandit} for textbook treatments on MAB. The fair prediction problem considered in this paper belongs to a class of MAB problems with global constraints. Several authors have considered variants of the fair prediction problem in MAB with widely varying definitions for fairness \citep{joseph2016fairness, gillen2018online, bechavod2020metric, hossain2021fair, huang2022achieving}. Closer to our setting, the papers by \citet{patil2021achieving, claure2020multi}, and  \citet{li2019combinatorial} considered a stochastic MAB problem while requiring the minimum \emph{fraction} of pulls of each arm to exceed a given threshold. \citet{celis2019controlling} considered a similar problem in the personalized recommendation setting where both the minimum and the maximum fraction of pulls are constrained in order to avoid the polarization of views. Similar to ours, \citet{li2019combinatorial} used a virtual queueing recursion to handle the fairness constraints. However, their UCB-based policy yields a regret bound which varies \emph{linearly} with the horizon length \citep[Theorem 2]{li2019combinatorial}.  \citet{chen2020fair} considered the above problem in the contextual bandit setting and proposed a no-regret policy with a known context distribution. \citet{cai2018online} considered a related stochastic MAB problem with a long-term constraint on an auxiliary (level-$2$) reward process, which is assumed to be \emph{independent} of the main (level-$1$) rewards of the arms. On the other hand, in our problem, the corresponding level-$1$ and level-$2$ reward processes are identical, and hence, these results do not apply due to the lack of the independence assumption. \citet{badanidiyuru2018bandits, immorlica2022adversarial}, and  \citet{xia} considered the Bandits with Knapsack (BwK) problem in the stochastic and adversarial settings. In this problem, a given resource budget is allocated to the arms at the beginning, and the policy continues until one of the arms finishes all of its budgets. \citet{immorlica2022adversarial} used a Lagrangian-based technique to design a no-regret policy for the BwK problem. 
%Similar to ours, they also employed an adversarial bandit policy in the stochastic setting. 
A recent paper by \citet{bistritz2022queue} considered a similar multiplayer multi-armed bandit problem with QoS constraints. However, they did not provide any regret bound. In this connection, we also mention a parallel line of work on fair resource allocation policies where, instead of meeting explicit constraints, the objective is to maximize a non-linear concave utility function of the cumulative rewards \citep{nofra}. Our problem is also closely connected to a recent series of works on Online Convex Optimization (OCO) with long-term constraints \citep{neely2017online, yu2017online, yuan2018online, castiglioni2022unifying}. While these papers propose problem-specific policies, we give a black-box reduction using any arbitrary adaptive learning policy as a subroutine and achieve state-of-the-art regret and constraint violation bounds. Furthermore, while most of the previous papers consider the full-information setting and/or assume the strict feasibility or Slater's condition, we consider the more general bandit feedback setting \emph{without} making any additional assumptions. The Lyapunov-based technique presented in this paper has been recently extended to solve the problem of OCO with long-term constraints as well \citep{sinha2023playing, sinha2024tight}.

%The classic Proportional Fair scheduler solves the fair scheduling problem when the current channel conditions are known \citep{stolyar2005asymptotic}. However, maximizing the sum rate with arm-specific rate constraints - a central problem in 5G network slicing remains open under unknown channel conditions. In the context of online learning, 

 
%\cmt{mention the INFOCOM paper with $T^5/6$ regret}. 


%Broadly speaking, the fair prediction problem belongs to a class of reinforcement learning problem with global constraints. 
%Finally, we also mention a recent line of work that proposes no-regret algorithms for general linear dynamical systems which, in principle, can be used to model constrained MAB problems 
%\citep{hazan2017learning, hazan2018spectral}.  
\subsection{Our contributions}
In contrast with a major line of work on fair MABs, which is mainly concerned with guaranteeing a minimum frequency of plays for each arm (\emph{procedural fairness}), in this paper, we initiate the study of a class of problems guaranteeing a minimum \emph{rate} of reward accruals for each arm (\emph{substantive fairness}).
%, which depends on the \emph{unknown} reward distributions of the arms. 
Compared to the standard MAB problem, here, the difficulty stems from the fact that in addition to playing the unknown best arm sufficiently many times, other arms with unknown mean rewards also ought to be played frequently enough so as to satisfy the given fairness constraints. Consequently, the design of our algorithm and its analysis proceed along a different line from that of the prior works. In particular, we claim the following contributions:

\begin{enumerate}
	%\item We use adversarial experts algorithm for solving a constrained prediction problem in the stochastic setting. The technical reason for this approach is that, the constrained problem is reduced to an instance of unconstrained problem with a new set of rewards which could be correlated in a complex fashion.
	\item We propose a fair learning policy for stochastic bandits, called \textsc{BanditQ}, via a \emph{black-box} reduction to the standard adversarial MAB problem. The problem is studied in both full information and bandit feedback settings. The proposed \BQ policy keeps track of the global reward rate constraints through an auxiliary queueing process, which is then used to define the rewards for the unconstrained MAB problem recursively. 
	
	\item An attractive feature of our policy is that it is completely oblivious to the algorithm used for the unconstrained MAB problem. In particular, \BQ can use \emph{any} existing MAB policy with a data-dependent adaptive regret bound. The key to this attractive separation result is a new \emph{self-bounding} inequality that bounds the sum of the regret and current rate violations in terms of past violations. 
	%This decomposition technique could be useful for studying other constrained sequential learning problems as well.
	\item We introduce a new proof technique that bounds the regret and rate violations by solving certain sequential inequalities. The proof arguments are crisp and utilize off-the-shelf adaptive regret bounds.
	\item We supplement our theoretical results with illustrative numerical experiments.
	%	\item We give a unified treatment for the full-information and bandit-feedback cases with almost identical proofs. 
	\end{enumerate}
 %As a technical note, compared to previous works on bandits with knapsacks where two different online policies are used (a primal and a dual \citep{immorlica2022adversarial}), in this paper, we use only one explicit no-regret policy as a subroutine. This makes our algorithm more efficient.
	%The key to our results is the use of adversarial MAB policies for handling the stochastic problem. 
	
%A similar approach was adopted by \citet{immorlica2022adversarial} where they used nearly identical policy for solving the stochastic and adversarial Bandits with Knapsacks (\textsc{BwK}) problem with optimal regret bounds. 
%\end{itemize}
  
%, which can be trivially solved in our setting by a randomized policy. 



%We consider the standard no-regret prediction problem with expert's advice in the adversarial setting with an additional long-term rate constraint for a subset of experts. 

%\textcolor{red}{Replace agents by arms?}

 
%More specifically, we use the strong $(w, \epsilon)$-admissibility criterion with respect to a fixed schedule. The problem is interesting both in the full information and bandit setting and we tackle the full information version first. 

%We will also consider the simpler problem where there is a  given rate vector $\bm{r}$ such that arm $i$ must be pulled at least $r_i$ fraction of times. We seek a uniform algorithmic framework for both these problems. 


\section{Problem formulation}  \label{model}  
We consider a regret minimization problem in the context of Multi-armed Bandits (MAB) with an additional fairness constraint. The fairness constraint requires that each arm in a given subset $\mathcal{P}$ (called \emph{protected class}) must attain pre-specified reward accrual rates, which are assumed to be feasible. Formally, we consider an $N$-armed bandit, which on round $t$ receives an unknown reward vector $\bm{r}(t) \in [0,1]^N.$ The vector $\bm{r}(t)$ is generated i.i.d. on each round with an unknown expectation $\bm{\mu}.$ On round $t$, an online policy first decides a  probability distribution $\bm{x}(t) \in \Delta_N,$ where $\Delta_N$ denotes the set of all probability distributions supported on $N$ arms. The policy then randomly samples an arm $I_t \in [N]$ from the distribution $\bm{x}(t)$\footnote{This protocol includes conditionally deterministic policies, such as \textsc{UCB}, where $\bm{x}(t)$ is supported on only one arm.}. Depending on the feedback structure, either the entire reward vector $\bm{r}(t)$  (in the case of full information feedback) or the reward of the sampled arm $r_{I_t}(t)$ only (in the case of bandit feedback) is revealed to the policy at the end of round $t$. The above process continues for $T$ rounds.   
%Due to the linearity of the rewards, it is sufficient for the policy to output a sequence of distributions $\{\bm{x}(t)\}_{t\geq 1}.$
%This short paper focuses on the full-information setting where the entire reward vector is revealed to the learner at the end of each round. 
\paragraph{Fairness constraints:}
%If an arm $I_t$ is selected by the policy on round $t$, then this arm 
Due to the action of the policy, the selected arm $I_t$ receives a random reward of value $r_{I_t}(t).$ 
Hence, if on round $t$, the online policy samples arms according to the distribution $\bm{x}(t)$, the $i$\textsuperscript{th} arm receives a (conditional) expected reward of $x_i(t)\mathbb{E}r_i(t)=x_i(t)\mu_i,$ and the online policy receives an overall (conditional) expected reward of $\langle \bm{x}(t), \bm{\mu} \rangle.$ 
%Let $\mathcal{P} \subseteq [N]$ denote the subset of arms belonging to the \emph{protected classes}. 
Let $\bm{\vec{\lambda}}$ be the given target reward rates vector.
The fairness constraint mandates that the long-term rate of rewards accrued by the arm $i \in \mathcal{P}$ must be at least $\lambda_i, \forall i\in \mathcal{P}$ (see Eqn.\ \eqref{rate-constr2}). For notational simplicity, we may assume that $\lambda_i=0, \forall i \in [N]\setminus \mathcal{P}.$ 
%We emphasize that guaranteeing lower bounds to the rate of reward accruals of the individual arms is different and more challenging than just constraining the frequency of sampling each arm without considering their accrued rewards. 
%To make the rate constraints feasible, 
%we assume that there is an offline probability distribution $\bm{x}^*$ over the arms with the above property. 
%we assume that there exists a feasible offline static allocation 
\paragraph{Offline Benchmark and Performance Metric:}
We compare the performance of an online policy against any fixed sampling distribution
$\bm{x}^* \in \Delta_N$ that meets the target reward rates. In other words, our comparator class $\Omega(\bm{\vec{\lambda}})$, indexed by the target vector $\bm{\vec{\lambda}}$, is defined as follows:
\begin{eqnarray} \label{feas}
	\Omega(\bm{\vec{\lambda}}) = \big\{\bm{x}^* \subseteq \Delta_N: x_i^* \mu_i \geq \lambda_i, ~\forall i \in \mathcal{P}\big\}.
 %\nonumber\\
%	&&\textrm{and}~ \sum_i x_i^* =1, ~\bm{x}^* \geq \bm{0}\big\}.
\end{eqnarray}  
Clearly, in order for the target rate vector $\bm{\vec{\lambda}}$ to be feasible (\emph{i.e.,} $\Omega(\bm{\vec{\lambda}}) \neq \emptyset$), it is necessary and sufficient that 
\begin{eqnarray} \label{feas-constr}
	\sum_i \frac{\lambda_i}{\mu_i} \leq 1.
\end{eqnarray}
See Section \ref{feas-sec} in the Appendix for a brief discussion on the feasibility assumption. 
%The above requirement can be compactly expressed as: 
%\begin{eqnarray} \label{rate-requirement}
%	x_i^* \geq \frac{\lambda_i}{\min_{1\leq t\leq T} r_i(t)},~ \forall i \in \mathcal{P}. 
%\end{eqnarray}
%Although the constraint in Eq.\ \eqref{rate-requirement} seems strong, clubbing together $w$ consecutive rounds at a time, our results hold even when the denominator in Eq.\ \eqref{rate-requirement} is replaced with the minimum rate averaged over any constant-sized window of size $w \geq 1.$ Note that the proposed \textsc{BanditQ} policy is oblivious to the value of $w$.
%Throughout the paper, we assume the feasibility of the required rate vector $\bm{\vec \lambda}$. In practice, we can ensure feasibility by estimating the expected rewards from past data and requiring that condition \eqref{feas-constr} is strictly satisfied with a reasonable margin. To put it quantitatively, let $\hat{\bm{\mu}}$ be the estimated expected reward vector where it is known that $||\hat{\bm{\mu}} -\bm{\mu}||_\infty \leq \epsilon,$ for a small error bound $\epsilon \geq 0$. Then, for the required reward rate vector $\bm{\vec \lambda}$ to be feasible, using the first-order Taylor's series expansion, it is sufficient that: 
%\begin{eqnarray} \label{feasibility_test}
%	\sum_i \frac{\lambda_i}{\hat{\mu_i}} + \epsilon \sum_i \frac{\lambda_i}{\hat{\mu_i}^2} \leq 1.
%\end{eqnarray}
%Although the estimated mean rewards can reasonably be used for determining the feasibility of the required reward rates, they cannot possibly be used for the online selection of the arms with no regret, as even a small constant error in the estimated rewards may lead to a linear regret. 
%
The set of all offline benchmarks $\Omega(\bm{\vec{\lambda}})$ is closed and convex with an Euclidean diameter of $D=\sqrt{2}.$ Our goal is to design a sampling policy $\{\bm{x}(t)\}_{t\geq 1}$ that achieves a sublinear regret against any $\bm{x}^* \in \Omega(\bm{\vec{\lambda}}),$ where 
%we use the standard definition of (pseudo-) regret:
\begin{eqnarray} \label{regret_def}
	\textrm{Regret}_T(\bm{x}^*) \equiv 
	%\max_{\bm{x}^* \in \Omega(\bm{\vec{\lambda}})} 
	\langle \bm{x}^*, \bm{\mu}\rangle T - \mathbb{E}\sum_{t=1}^T \sum_{i=1}^N r_i(t)\mathds{1}(I_t=i),
\end{eqnarray}
while meeting the long-term reward rate constraints defined next\footnote{When we refer to the worst-case regret, we drop the argument $\bm{x}^*$ in parenthesis in the regret definition \eqref{regret_def}.}. Asymptotically, for any time interval $\mathcal{I} \subseteq [T]$, the long-term rate constraint requires:
\begin{eqnarray} \label{rate-constr2}
	\liminf_{|\mathcal{I}| \to \infty} |\mathcal{I}|^{-1}\mathbb{E}\big[\sum_{t \in \mathcal{I}} r_i(t)\mathds{1}(I_t=i)\big]  \geq \lambda_i, ~\forall i \in \mathcal{P}.  
\end{eqnarray}
Note that Eq.\ \eqref{rate-constr2} requires the minimum reward rate guarantee to hold \emph{uniformly} across the time horizon for any sufficiently long interval of time. In other words, we require that no individual arm is starved for a long period of time - a problem left open by \citet{patil2021achieving}. Furthermore, following \citet{cai2018online}, we work with a fine-grained non-asymptotic metric \emph{rate violation penalty} defined below:
\begin{eqnarray} \label{violation_penalty}
	\mathbb{V}(T) = \max_{i \in P} \mathbb{E}\big[ \sum_{t=1}^T \big(\lambda_i - r_i(t) \mathds{1}(I_t=i)\big) \big].
\end{eqnarray} 
In brief, we seek to design an online policy for which \emph{both} $\textrm{Regret}_T$ and $\mathbb{V}(T)$ are sub-linear in $T$.
%In this paper, we will give a stronger \emph{uniform} guarantee of the reward rate achieved by our proposed policy.
We note two fundamental differences between the above problem and the standard online learning framework \citep{orabona2019modern}. First, contrary to the online learning setting, where the set of benchmarks $\Omega$ is specified a priori (independent of the rewards), in this problem, the set of benchmarks \eqref{feas} depends on the unknown reward distributions through their expectations $\bm{\mu}$. Second, unlike the online learning setting, the action taken by the policy on a round is not restricted to the set $\Omega(\bm{\vec{\lambda}})$ provided that the long-term target rates are met. Note that upon setting the vector $\vec{\bm{\lambda}}$ to zero, we recover the classic MAB problem as a special case. 
%In Appendix \ref{adapt}, we give a detailed discussion on the difficulty of adapting well-known bandit policies to this problem. 
In the following section, we introduce the \BQ policy in the full information setting.
%\cmt{Applications: Crowdsourcing, wireless scheduling.}
%\paragraph{The feasibility assumption:}
%Throughout the paper, we assume that the target rate vector $\bm{\vec \lambda}$ is feasible. In practice, we can ensure the feasibility by estimating the expected rewards from past data and requiring that condition \eqref{feas-constr} is strictly satisfied with a reasonable margin. To put it quantitatively, let $\hat{\bm{\mu}}$ be the estimated expected reward vector where it is known that $||\hat{\bm{\mu}} -\bm{\mu}||_\infty \leq \epsilon,$ for a small error bound $\epsilon \geq 0$. Then, for the required reward rate vector $\bm{\vec \lambda}$ to be feasible, using the first-order Taylor's series expansion, it is sufficient that: 
%\begin{eqnarray} \label{feasibility_test}
%	\sum_i \frac{\lambda_i}{\hat{\mu_i}} + \epsilon \sum_i \frac{\lambda_i}{\hat{\mu_i}^2} \leq 1.
%\end{eqnarray}
%Although the estimated mean rewards can reasonably be used for determining the feasibility of the required reward rates, they cannot possibly be used for the online selection of the arms with no regret, as even a small constant error in the estimated rewards may lead to a linear regret. 
%

%\paragraph{Feasibility of the rates:} \label{rate-feasibility}
%We assume the existence of an offline oracle that can vero
   
%Clearly, for the constraint \eqref{rate-requirement} to be feasible, it is 
%The following conditions on rates are necessary and sufficient for ensuring feasibility \eqref{feas}: 
%\begin{eqnarray} \label{nec-suff}
%	\sum_{i \in \mathcal{P}} \frac{\lambda_i}{\min_{1\leq t\leq T} r_i(t)} \leq 1, ~\textrm{and} 
%	~\lambda_i \leq \min_{1\leq t\leq T} r_i(t), \forall i\in \mathcal{P}.
%\end{eqnarray}
%In practice, the feasibility of a set of rate requirements can often be verified by assuming stationary rewards and using the historical reward data sequence. 