\section{Experiments} \label{simulation}
%In this section, we numerically compare the performance of \BQ policy with a set of state-of-the-art benchmarks \cmt{select the benchmarks}. 
\paragraph{Simulation Setup:}
We consider a problem instance with $N=5$ arms and $k=2$ protected classes consisting of the first and the second arm. We arbitrarily set the mean reward vector of the arms to $\bm{\mu} = \begin{pmatrix} 0.335, 0.203, 0.241, 0.781, 0.617 \end{pmatrix},$ and the target reward rates for the first and the second arm to $\lambda_1= 0.167$ and $\lambda_2= 0.067$ respectively. From Eq.\ \eqref{feas-constr}, it can be verified that the required rates are feasible for this problem. Clearly, Arm \# $4$ is the most rewarding among the five arms. We simulate the \BQ policy for $T=2\times 10^6$ rounds upon setting the parameter $V =\sqrt{T} \approx 1414.$ We write a custom optimizer, described in Appendix \ref{opt_implementation}, to efficiently implement the optimization subroutines. The simulation code has been made publicly available \citep{BanditQ_code}.
% for the \BQ policy in the bandit feedback set up. 
%We now report the performance of the \BQ policy under both full-information and bandit feedback. 
%\cmt{Compare the performance with an oracle policy that knows the rewards and hence, knows the optimal fraction of pulls.} 
\paragraph{Discussion:}
Figures \ref{rew_full}, \ref{q_full}, and \ref{reg_full} show the performance of the \BQ policy in the full-information set-up. Figure \ref{rew_full} shows that the protected arms, Arm 1 and Arm 2, asymptotically meet their target rates. Observe that since both Arm 1 and Arm 2 have sub-optimal expected rewards, they would have received asymptotically zero reward rates under the action of an unfair prediction policy, such as UCB. Figure \ref{q_full} shows the evolution of the queue length variables, and Figure \ref{reg_full} shows the regret of the \BQ policy in the full-information setting. Negative values of the regret suggest that the cumulative reward of the \BQ policy exceeds the reward achieved by the static benchmark policy, which is forced to take actions from the restricted set $\Omega$ on \emph{all} rounds - a constraint that the \BQ policy does not need to respect on every round. Figures \ref{rew_bf}, \ref{q_bf}, and \ref{reg_bf} show the corresponding plots in the bandit feedback setting. As expected, in the case of bandit feedback, the variables exhibit greater empirical variance compared to their full-information counterpart due to the limited availability of information. However, the \BQ policy achieves the target rates in this case as well. See Section \ref{addl_sim2} in the Appendix for a similar experiment with $N=1000$ arms.      
%In both cases, we find that the observed regret of the \BQ policy is much better than the theoretical bounds. This observation can be justified by the fact that we, in fact, used adversarial MAB policies in the benign i.i.d. stochastic setting which can take actions from a larger set $\Delta_N$. 
%The regret plots suggest that it might be possible to prove sublinear regret bound for the \BQ policy in the case of adversarial rewards as well. 
\paragraph{Additional experiments:} A comparison of the \BQ policy with a UCB-based oracle policy, proposed by \citet{li2019combinatorial}, has been given in Appendix \ref{addl_sim1}. The oracle is assumed to know a feasible fraction of pulls to achieve the target rates. The plot in Figure \ref{rew-comp} shows that the proposed \BQ policy achieves more cumulative rewards compared to the oracle policy as it decides its actions adaptively.

%\begin{figure*}[h!]
%  \centering
%  \begin{minipage}[b]{0.3\linewidth}
%   \centering
%    \includegraphics[width=0.9\linewidth]{./Figures/Reward_rates_full_info_cropped.pdf}
%   \caption{\small{Reward accrual rates in the full-information setting}}
%   \label{rew_full}
%  \end{minipage}
%   \begin{minipage}[b]{0.3\linewidth}
%   \centering
%    \includegraphics[width=0.9\linewidth]{./Figures/Q_lengths_full_info_cropped.pdf}
%   \caption{\small{Queue lengths in the full-information setting}}
%   \label{q_full}
%  \end{minipage}
%   \begin{minipage}[b]{0.3\linewidth}
%   \centering
%    \includegraphics[width=0.9\linewidth]{./Figures/Regret_full_info_cropped.pdf}
%   \caption{\small{Regret of \BQ in the full-information setting}}
%   \label{reg_full}
%  \end{minipage}
%  \begin{minipage}[b]{0.3\linewidth}
%   \centering
%    \includegraphics[width=0.9\linewidth]{./Figures/Reward_rates_bandit_feedback_cropped.pdf}
%   \caption{\small{Reward accrual rates in the bandit feedback}}
%   \label{rew_bf}
%  \end{minipage}
%  %\hfill
%  \begin{minipage}[b]{0.3\linewidth}
%   \centering
%    \includegraphics[width=0.9\linewidth]{./Figures/Q_lengths_Bandit_feedback_cropped.pdf}
%   \caption{\small{Queue lengths in the bandit feedback setting}}
%   \label{q_bf}
%  \end{minipage}
% %\hfill
%   \begin{minipage}[b]{0.3\linewidth}
%   \centering
%    \includegraphics[width=0.9\linewidth]{./Figures/Regret_bandit_feedback_cropped.pdf}
%   \caption{\small{Regret of \BQ in the bandit feedback setting}}
%   \label{reg_bf}
%  \end{minipage}
  %\caption{Illustrating the performance of the \BQ policy under the full information and bandit feedback settings. The parameter $V$ was set to $\sqrt{T}$ in both cases.}
%\end{figure*}

%\begin{figure}[ht]
%  \centering
%  \subfigure[Caption for Figure 1]{\includegraphics{./Figures/Reward_rates_bandit_feedback_cropped.pdf}}
%  \subfigure[Caption for Figure 2]{\includegraphics[width=0.3\linewidth]{./Figures/Q_lengths_Bandit_feedback_cropped.pdf}}
%  \subfigure[Caption for Figure 3]{\includegraphics[width=0.3\linewidth]{./Figures/Regret_bandit_feedback_cropped.pdf} \\
%  \subfigure[Caption for Figure 4]{\includegraphics[width=0.3\linewidth]{figure4}}
%  \subfigure[Caption for Figure 5]{\includegraphics[width=0.3\linewidth]{figure5}}
%  \subfigure[Caption for Figure 6]{\includegraphics[width=0.3\linewidth]{figure6}}
%  \caption{Caption for the whole figure}
%  \label{fig:my_figure}
%\end{figure}
