% \begin{figure*}[htb]
% \vspace{-3mm}
%   \centering
%   \begin{minipage}[t]{0.32\linewidth}
%     \centering
%     \includegraphics[width=\linewidth]{Img/1.1 a=0.01 eta=0.01 A=5.png}
%     \subcaption{Limited information}
%     \label{fig:subfig1}
%   \end{minipage}
%   \begin{minipage}[t]{0.32\linewidth}
%     \centering
%     \includegraphics[width=\linewidth]{Img/2.1 a=0.01 eta=0.01 A=5 gap=0.219.png}
%     \subcaption{Omniscient follower}
%     \label{fig:subfig2}
%   \end{minipage}
%   \begin{minipage}[t]{0.32\linewidth}
%     \centering
%     \includegraphics[width=\linewidth]{Img/2.2 a=0.01 eta=0.01 A=5 gap=0.377.png}
%     \subcaption{Noisy side information}
%     \label{fig:subfig3}
%   \end{minipage}
%   \caption{Experiments for the limited information and side information settings. Each experiment is run 5 times to calculate the mean and std (standard deviation). The shaded area shows $\pm$std.}
%   \label{fig:subfig}
%   \vspace{-2mm}
% \end{figure*}

\section{Empirical Results}

To validate the theoretical results, we conduct experiments on a synthetic Stackelberg game. For both players, the number of actions $A\!=\!B\!=\!5$. The players' mean rewards $\mu_{l}(a,b)$ and $\mu_{f}(a,b)$ of each action pair $(a,b)$ are generated independently and identically by sampling from a uniform distribution $\text{U}(0,1)$. %The mean rewards constitute a reward matrix for each player. 
For any action pair $(a,b)$, the noisy reward in round $t$ is generated from a Bernoulli distribution: $r_{l,t}(a,b)\sim \text{Ber}\left(\mu_{l}(a,b)\right)$ and $r_{f,t}(a,b)\sim \text{Ber}\left(\mu_{f}(a,b)\right)$, respectively. All experiments were run on a MacBook Pro laptop with a 1.4 GHz quad-core Intel Core i5 processor, a 1536 MB Intel Iris Plus Graphics 645, and 8 GB memory. 















Our first experiment validates the convergence of no-regret learning algorithms such as EXP3-UCB and UCBE-UCB (Theorems~\ref{thm:exp3-ucb} and~\ref{thm:ucbe-ucb} in Section~\ref{sec:myopic_follower}). Figure~\ref{fig:subfig1} shows the average regret of the leader w.r.t. round $t$. The blue and orange curves are respectively for EXP3-UCB ($\alpha\!=\!0.01$, $\eta\!=\!0.001$) and UCBE-UCB. We can see that the leader achieves no Stackelberg regret learning via both algorithms, implying that the game converges to Stackelberg equilibrium in both cases. 


The second experiment validates the intrinsic reward gap between FBM and BR when the follower is omniscient (Proposition~\ref{prop: average follower's reward} in Section~\ref{sec:farsighted_omniscient}). In this experiment, the leader uses EXP3 ($\alpha=0.01$, $\eta=0.001$), and the follower uses BR (blue curve) or FBM (red curve). The result is in Figure~\ref{fig:subfig2}, where the x-axis is the number of rounds $t$, and the y-axis is the average follower reward. It shows that after playing sufficient rounds, the follower's reward does converge, and FBM yields a significant intrinsic advantage (the gap between the two dashed lines) of approximately 0.22 compared to that of BR.



The third experiment validates the intrinsic reward gap between FMUCB and UCB when the follower learns the leader's reward structure via noisy bandit feedback (Theorems~\ref{thm:EXP3-FMUCB} and~\ref{thm:UCBE-FMUCB} in Section~\ref{sec:farsighted_bandit}). The leader uses either EXP3 ($\alpha=0.1$, $\eta=0.001$) or UCBE, and we compare the average follower reward w.r.t. round $t$ when it uses FMUCB v.s. the baseline best response learned by a vanilla UCB. This introduces four curves, EXP3-UCB, EXP3-FMUCB, UCBE-UCB, and UCBE-FMUCB. We can see that no matter whether the leader uses EXP3 or UCBE, FMUCB yields a significant reward advantage of about 0.3 for the follower compared to that of UCB. 

