% !TEX root =  main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
When machine learning (ML) models are deployed in practice, they can affect the prediction target itself, causing a distribution shift. This problem has received significant attention in supervised learning and is termed as \emph{performative prediction}~\citep{PZM+20}. In practice, it is often approached with {\em repeated retraining}: a practical solution for finding a (performatively) stable model, which does not suffer from further distribution shift. 

Recently, \cite{MTR23} considered a reinforcement learning (RL) variant of this problem setting. In RL, performativity manifests itself as a shift in the environment, depending on the policy which was deployed by the learner. For example, the environment can model users of an online platform (e.g., recommender system or a chatbot), who adapt to the changes in the policy of the RL agent that controls the platform. 

\cite{MTR23} formalizes this setting with a framework called \emph{performative RL}, where the dynamics of a Markov decision process (MDP) $M_t$ depend on the current policy $\pi_t$.
To find an approximately stable policy, they propose repeated retraining over the space of occupancy measures $\mathcal C(M_t)$ and the regularized objective 
\[\max_{d \in \mathcal C(M_t)} \sum_{s, a} 
r_t(s, a) \cdot d (s,a) - \lambda \|d\|_2, \]
where $r_t$ is the reward function of $M_t$, the sum goes over all possible states $s$ and actions $a$, and $\lambda$ is a regularization factor. 

However, this framework assumes that the environment only depends on the deployed policy and is independent of the previous environment. In many practical scenarios, this assumption does not hold. Going back to our examples from before, users are likely to manifest a learning behavior when interacting with the platform, and thus adapt their behavioral patterns gradually to any changes made in the platform, instead of adapting immediately after every change. 
Thus we consider an extension of the performative RL framework where the underlying MDP $M_t$ is gradually changing over time.
\input{1.1_main_table}
\paragraph{Contributions}
Following a similar line of work on performative prediction that considers gradual shifts in the distribution~\citep{BHK20,LW22,RRL+22,izzo2022learn}, we model this scenario by assuming that the underlying MDP $M_t$ is dependent on both the deployed policy $\pi_t$ and the MDP from the previous round, i.e., $M_{t-1}$. Our overall goal is to analyze different repeated retraining approaches and provide characterization results that compare these approaches along the following three measures: a)~attainable approximation quality (i.e., the minimum value of $\lambda$ for which the convergence is guaranteed),  b)~the number of retrainings which guarantees the convergence (signifying the compute needed to converge), and c)~the sample complexity per deployment (signifying the number of data points that need to be collected). 
Our main contributions are as follows:
\begin{itemize}
    \item \emph{Framework:} An extension of the performative RL framework that can model gradual environment shifts, and an extension of the DRR algorithm from \cite{BHK20}, suitable for our framework.
    \item \emph{Algorithm:} A novel repeated retraining algorithm, called MDRR, which compared to repeated retraining (RR) and DRR uses samples from multiple rounds of deployment, thereby reducing the number of samples needed per round.
    \item \emph{Characterization results:}
     A characterization of three repeated retraining approaches: a canonical RR, DRR, and MDRR.
    Our analysis is a non-trivial combination of the proof techniques used by \cite{MTR23} and \cite{BHK20} and brings additional insights about regularization in performative RL.
    The overview of the results can be found in Table~\ref{table:overview}. 
    At a high-level, our theoretical results suggest that DRR and MDRR fare better than RR in terms of the number of retrainings and sample complexity, as well as in terms of attainable approximation quality when the environment depends weakly on the current policy. When the environment depends strongly on the previous environment, MDRR fares better than RR and DRR in terms of samples per round.
    These results shed light on regularization in performative RL, and the importance of utilizing historic data to reduce it, thus obtaining better approximation quality.
    \item \emph{Experiments:} Finally, we compare the algorithms in an experimental evaluation. In our experiments, MDRR outperforms RR and DRR in terms of the convergence speed and the quality of the solution obtained.
\end{itemize}