
\section{Related Work}
% zero-sum but not Stackelberg~\citep{balduzzi2019open, daskalakis2018last}
\noindent\textbf{Decentralized learning in simultaneous multi-agent games.}
This line of works has broad real-world applications, e.g., when multiple self-interested teams patrol over the same targets in security domains~\citep{jiang2013defender} or wildlife conservation~\citep{wwf2015parkistan}, or when different countries that independently plan their own actions in international waters against illegal fishing~\citep{klein2017can}. There have been extensive studies in this category \citep{blum2007external,wu2022multi,jin2021v,meng2021decentralized, wei2021last}. The seminal work of \citet{blum2007external} shows that decentralized no-regret learning in multi-agent general-sum games leads to a coarse correlated equilibrium (CCE). The result has been improved using more sophisticated methods like optimistic hedge~\citep{daskalakis2021near,chen2020hedging,anagnostides2022near}. \citet{jin2021v,liu2021sharp} study a similar question in Makov games that involves sequential decision and reinforcement learning. 


\noindent\textbf{Learning in Stackelberg games.}
While our work also focuses on decentralized learning, we focus on learning in Stackelberg games where the players act sequentially. \citet{lauffer2022no} study a different setting than ours where the leader first commits to a \textit{randomized} strategy, and the follower observes the randomized strategy and best responds to it. Such a setting also appears in~\citet{balcan2015commitment}. In our setting, the leader first plays a \textit{deterministic} action, and the follower responds to the action after observing it, a setting that is closer to~\citet{bai2021sample,kao2022decentralized}. Many past works about learning in Stackelberg games, like security games~\citep{blum2014learning,peng2019learning,balcan2015commitment,letchford2009learning}, only focus on learning from the leader's perspective, assuming access to an oracle of the follower's best response. More recently, there have been rising interests for Stackelberg games involving learning of both players~\citep{goktas2022zero,sun2023zero,kao2022decentralized, zhao2023online}. They focus on sub-classes of games with specific reward structures. \citet{goktas2022zero,sun2023zero} study zero-sum stochastic Stackelberg games where the sum of rewards of the two players is zero. \citet{kao2022decentralized, zhao2023online} study cooperative Stackelberg games where the leader and the follower share the same reward. We study general-sum Stackelberg games, a more generalized setting without assuming specific reward structures like the above. The lack of prior knowledge of the reward structures, however, makes the learning problem much more challenging. As a matter of fact, learning in repeated general-sum Stackelberg games remains an open problem. The setting in \citet{bai2021sample} is closer to us, where it also learns general-sum Stackelberg equilibrium from noisy bandit feedback. But their learning process needs a central device that queries each leader-follower action pair for sufficient times, making it essentially a centralized model or offline learning. We study an online learning problem in repeated \textit{general-sum} Stackelberg games, where the players act in a \textit{decentralized} and \textit{strategic} manner. Last and very importantly, a hidden assumption in these works is that the follower's best strategy is to best respond to the leader's actions. In addition to studying Stackelberg games with a best-responding follower (Section~\ref{sec:myopic_follower}), we also take a new perspective of a manipulative follower (Sections~\ref{sec:farsighted_omniscient}-\ref{sec:farsighted_bandit}), and show that manipulative strategies can indeed yield higher payoff for the follower in an online learning setting.



\noindent\textbf{Follower manipulation in Stackelberg games.} 
It is known that the leader has the first-mover advantage in Stackelberg games: an optimal commitment in a Stakelberg equilibrium always yields a higher (or equal) payoff for the leader than in any Nash equilibrium of the same game~\citep{von2010leadership}. This result is under the assumption that the leader has full information about the follower's reward, or that it can learn such information via interacting with a truthful follower~\citep{letchford2009learning,blum2014learning,haghtalab2016three,roth2016watch,peng2019learning}. In other words, they assume that follower will always play myopic best responses to the leader's strategy. Recent studies showed that a follower can actually manipulate a commitment leader, and induce an equilibrium different from the Stackelberg equilibrium by misreporting their payoffs~\citep{gan2019manipulating,gan2019imitative,nguyen2019imitative,birmpas2020optimally,chen2022optimal,chen2023learning}. In parallel with these works, we consider an \textit{online} learning setup where \textit{both players} learn their best strategies from noisy bandit feedback. This is in contrast to the above existing works which take the learning perspective of only the follower against a commitment leader in an offline setup. 

