% !TEX root =  main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Related Work}

We relate our work to four lines of research: {\em Performative Prediction}, {\em Markov Games}, {\em Adversarial Markov Decision Processes}, and {\em Reinforcement Learning}. 
The latter two are discussed in Appendix~\ref{appdx.related-work}.

\textbf{Performative Prediction. }
The study of performative prediction was initiated by~\cite{PZM+20}. They investigate conditions under which repeated retraining converges to a  performatively stable point.
This study was extended in various ways, including stochastic optimization~\citep{MPZH20},
finding performatively optimal points~\citep{MPZ21, izzo2021learn},
multi-agent scenarios~\citep{narang2023multiplayer, li2022multi} and using performativity to measure the power of firms~\citep{hardt2022performative}.
\citet{mofakhami2023performative}
use a different set of assumptions and provide convergence guarantees also in cases where the loss is not strongly convex in the parameters of the model. 
Most related to our setting are works that consider performative prediction under gradual shifts in the distribution~\citep{BHK20,LW22,RRL+22,izzo2022learn}, commonly known as \emph{stateful} performative prediction~\citep{BHK20}. All of the above works study performativity in supervised learning. In contrast, we consider reinforcement learning. However, we emphasize that some of our results are extensions of or inspired by those that appear in~\citep{BHK20}. Most notably, we extend delayed repeated retraining, an algorithm proposed by~\citep{BHK20}, to our RL setting, and analyze its convergence guarantees. Furthermore, we introduce a novel algorithm inspired by delayed repeated retraining. 

{\bf Markov Games.} Our work is also related to the literature on stochastic or Markov games~\citep{shapley1953stochastic} and multi-agent reinforcement learning~\citep{zhang2021multi}. 
Much of the focus in multi-agent RL have been on computational and statistical aspects of learning Nash or correlated equilibria~\citep{daskalakis2023complexity,wei2017online,bai2020near,jin2022v}. Our setting is more related to multi-agent RL frameworks that consider Stackelberg or commitment policies  \citep{letchford2012computing,vorobeychik2012computing,dimitrakakis2017multi,zhong2021can}, where a principal agent commits a policy to which one or more followers best responds. Computing optimal commitment policies is in general computationally intractable~\cite{letchford2012computing}. Hence, some restrictions on followers' response models are needed to enable computationally efficient learnability~\cite{zhong2021can}. 
%
Similarly, no-regret learning in a two-agent principal-follower setting where the follower independently learns or changes its policy over time is also in general computationally intractable~\citep{radanovic2019learning,bai2020near}. 
However, if the dynamics of the follower's policy updates is not {\em adversarial}, tractable no-regret algorithms exist~\citep{radanovic2019learning}. These restrictions on the follower are similar in spirit to the setting and the assumptions we consider in this paper, however, our setting is technically quite different: whereas these works focus on no-regret learning, we focus on performative RL and repeated retraining approaches.
