\section{Preliminary}
A general-sum Stackelberg game is represented as a tuple $\{\mathcal{A}, \mathcal{B}, \mu_l, \mu_f\}$. $\mathcal{A}$ represents the action set of leader $l$, $\mathcal{B}$ represents the action set of follower $f$, and $|\mathcal{A}|=A$, $|\mathcal{B}|=B$. We denote $\mathcal{A}\times\mathcal{B}$ the joint action set of the leader and the follower. 
For any joint action $(a,b)\in \mathcal{A}\times \mathcal{B}$, we use $r_{l}(a, b) \in [0,1]$ and $r_{f}(a , b) \in [0,1]$ to respectively denote the noisy reward for the leader and the follower, with expectations $\mu_l(a,b) \in [0,1]$ and $\mu_f(a,b) \in [0,1]$. The follower has a response set to the leader's actions, $\mathcal{F}=\{\mathcal{F}(a)|a\in\mathcal{A}\}$, where $\mathcal{F}(a)$ is the response to leader's action $a$. %We also call $\mF$ as the follower's manipulation. 
The follower's \textit{best response} towards $a$ is defined as the action that maximizes the follower's true reward:
\begin{equation}
\mathcal{F}_{br}(a) = \argmax_{b\in\mB} \mu_f(a,b),
\end{equation}
We denote $\mathcal{F}_{br}=\{\mathcal{F}_{br}(a)|a\in\mathcal{A}\}$ as the best response set. 
For simplicity, we assume that the best response to each action $a$ is unique and the game has a unique Stackelberg equilibrium $(a_{se}, b_{se})$, 
\begin{equation}
    a_{se} = \argmax_{a\in\mA} \mu_l(a,\mFb(a)), \ b_{se}=\mathcal{F}_{br}(a_{se}).
\end{equation}
A \textit{repeated} general-sum Stackelberg game is played iteratively in a total of $T$ rounds. In each round $t$, the procedure is as follows:
\begin{itemize}[leftmargin=*]
\vspace{-2mm}
\setlength\itemsep{0pt}
    \item
    The leader plays an action $a_t \in \mathcal{A}$.
    \item
    The follower observes the leader's action $a_t$, and plays $b_t = \mathcal{F}_t(a_t)$ as the response.
    \item
    The leader receives a noisy reward $r_{l,t}(a_t,b_t)$.
    \item
    The follower receives reward information $r_{t}(a_t,b_t)$ which we will elaborate more in the following.
    \vspace{-2mm}
\end{itemize}

In the last step, as motivated by the example in Table~\ref{tab:FM_example}, we study two types of settings depending on the kind of reward information that is available to the follower:
\begin{itemize}[leftmargin=*]\vspace{-2mm}
\setlength\itemsep{0pt}
    \item 
    \textbf{Limited information.} The follower observes a (noisy) reward of its own: $r_{t}(a_t,b_t) = r_{f,t}(a_t,b_t)$. 
    \item 
    \textbf{Side information.} The follower has extra side information, meaning that it also knows the reward of the leader in addition to its own. This setting can be further divided into the case of an \textbf{omniscient follower} who knows the exact reward functions $\mu_l$ and $\mu_{f}$ directly; or the case of \textbf{noisy side information} where the follower learns the rewards from noisy bandit feedback in each round, i.e., $r_{t}(a_t,b_t) = (r_{l,t}(a_t,b_t), r_{f,t}(a_t,b_t))$. 
    \vspace{-2mm}
\end{itemize}

In the following, we will present our results for the limited information setting (Section \ref{sec:myopic_follower}), the omniscient follower setting (Section \ref{sec:farsighted_omniscient}) and the noisy side information setting (Section~\ref{sec:farsighted_bandit}).
