% !TEX root =  main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Repeated Retraining (RR)}\label{sec.rr}
One common approach in performative prediction is \emph{repeated retraining (RR)}, where the learner updates its policy at every round, by best responding to the current environment. In this section, we explore guarantees for when this approach converges to a stable occupancy measure. 

In RR we assume that the learner updates its policy every round in such a way that it is optimal for the regularized objective of the current MDP~$M_t$.
In particular, we define $d_{t+1}$ to be a solution to the following optimization problem.
\begin{align}
    \label{eq:regularized-rl-discounting}
       \max_{d\ge 0}&\   \sum_{s,a} d(s,a) r_t(s,a) - \frac{\lambda}{2}\norm{d}_2^2\\
       \textrm{s.t. } & \sum_a d(s,a) = \rho(s) + \gamma \cdot \sum_{s',a} d(s',a) P_t(s',a,s)\ \forall s
       \nonumber
\end{align}

We go on to show that RR converges to a stable occupancy measure.
\begin{theorem}[informal, details in Appendix~\ref{appdx.rr-exact}]\label{thm:standard-rr-simple}
    Assume that Assumption~\ref{assumption_sensitivity} holds
    and
    $\lambda=\calO\left(\frac{\sizeS^{5/2}}{(1-\epsilon)(1-\gamma)^4}\right)$.
    %\label{cond_lambda_r}
    Then for any $\delta > 0$ we have,
    
    \centering
    $\norm{d_t - d_S}_2 \leq \delta ,$
    
    for all $t \geq 
    \frac{\ln\left(\left(
        \frac{2}{1-\gamma} + \left(1+\sqrt{2}\right)\sqrt{\sizeS \sizeA}
        \right)/\delta\right)}{
        \ln\left(2/\left(1+\epsilon\right)\right)
        }\ .$
\end{theorem}
The bound on $\lambda$ in Theorem~\ref{thm:standard-rr-simple} is comparable to the one required in standard Performative RL, there are only differences in the constants and the $\epsilon$ factors.
The bound on the number of rounds $t$ in standard Performative RL is
$2\ln\left(\frac{2}{\delta(1-\gamma)}\right)/\left(1-\frac{\sizeS^{5/2}\epsilon}{\lambda(1-\gamma)^4}\right)$, which is comparable to the bound here, when only considering the $\delta$ parameter. 
We note that the bound on $t$ in Theorem~\ref{thm:standard-rr-simple} does depend on $\lambda$, but for the simplicity of the exposition it is swapped by the lower bound on $\lambda$ instead. The full theorem is found in Appendix~\ref{appdx.rr-exact}.

The proofs of this paper are found in the appendix.
In general, the proofs rely on a non-trivial combination of adapting arguments from~\cite{BHK20} to the RL setting and using results from~\cite{MTR23}. 
Additionally, we extend the analysis by introducing a distinction between the parameter $\iota$ indicating how the environment adapts to a deployed policy, and $\epsilon$ indicating how strongly the environment depends on the previous environment. 
We therefore view our main contribution in this section and Section~\ref{sec.drr} as bridging the gap between the theoretical findings of~\cite{MTR23} and the often more realistic assumptions made by history-dependence, as in~\cite{BHK20}.
In section~\ref{sec.mdrr} we will introduce a novel algorithm.

\subsection{Finite Sample Guarantees}
Theorem~\ref{thm:standard-rr-simple} assumes that the learner knows the exact environment when updating its policy. In practice, this is usually too strong of an assumption, since the learner typically has access to only a finite number of samples drawn via the deployed policy on the adopted environment. In this subsection, we first discuss some general considerations for this new setting and then show that RR also converges here.

\paragraph{Update rule for RR}
The learner has access to i.i.d. drawn set of samples $F_t$ for each round $t$.
In round $t$, let $m_t := \abs{F_t}$ be the number of samples.

As prior work, we use the following \emph{empirical Lagrangian} to devise an optimization problem in the finite sample setting~\citep{MTR23}.
\begin{align}\label{eq:empirical-Lagrangian}
    \begin{split}
    \hat{\calL}(d,h; t) =  - \frac{\lambda}{2} \norm{d}_2^2 + \sum_s h(s) \rho(s) 
    \\+
    \sum_{(s,a,r,s')\in F_t} \frac{d(s,a)}{\bar{d}_t(s,a)} \cdot \frac{ r - h(s) + \gamma h(s')  }{m_t(1-\gamma)}
    \end{split}
\end{align}
Here $h$ is the Lagrange multiplier with one entry for each $s\in S$.

The empirical Lagrangian is defined in such a way that when we take its expectation over samples, we obtain the exact Lagrangian $\calL$ of optimization~\eqref{eq:regularized-rl-discounting}. One can show that the empirical Lagrangian~$\hat{\calL}$ lies in a neighborhood of the true Lagrangian $\calL$ almost certainly.
The learner repeatedly solves
\begin{align}\label{eq:repeated-optim-finite}
    (d_{t+1}, h_{t+1}) = \argmax_{d} \ \argmin_{h} \hat{\calL}(d,h; t)\ .
\end{align}

We need a further assumption, which ensures an overlap in the occupancy measure between the behavioral policy and the target policy space.
This assumption is standard in offline RL~\citep{munos2008finite, zhan2022offline, MTR23}.
Without such an overlap, it is unclear how the learner would compute an optimal policy.
\begin{assumption}\label{assumption-offline-rl} 
    Assume we are given an integer $k$.
    Given occupancy measure $d$, initial transition probability function $P_0$ and 
    initial reward function $r_0$, let 
    $P_t$ and $r_t$ be the result after the learner deploys $\pi^d$ for 
    $t$
    rounds.
    Let $d_t^*$ be the solution to optimization problem~\eqref{eq:regularized-rl-discounting}.
    Let $\bar{d}_t$ be the occupancy measure of $\pi^d$ in 
    $P_t$. Then there exists $B>0$ such that
    for all $d$ and $t\leq k$ it holds that
    \begin{equation*}
    \max_{s,a}\left|\frac{d_{k}^*(s,a)}{\bar{d}_t(s,a)}\right|\leq B\ .
    \end{equation*}
\end{assumption}
Note that we only need overlap for state-action pairs where the optimal policy $d^*_t$ is non-zero. So values where $d_k^*(s,a)$ is $0$ are allowed iff $\bar{d}_t(s,a)$ is $0$ for all $t\leq k$. 

% We claim that this is a realistic assumption because the learner can control the policy used. Thus, they can ensure that the probability of choosing any action $a$ in any state $s$ is lower bounded by some positive value. This would lead to a non-zero probability of reaching any reachable state-action pair and thus ensure that Assumption~\ref{assumption-offline-rl} is satisfied. (This is achieved e.g. in $\epsilon$-greedy exploration.)

We can then show the following guarantee for RR.
\begin{theorem}[informal, details in Appendix~\ref{appdx.rr-finite}]\label{thm:finite-samples-RR-simple}
    Suppose that overlap Assumption~\ref{assumption-offline-rl} holds for $k=1$
    with parameter $B$ and 
    Assumption~\ref{assumption_sensitivity} holds.
    Let $p>0$.
    Then for 
    $\lambda= \mathcal{O}\left(\frac{\epsilon(\sizeS+\gamma \sizeS^{5/2})}{(1-\epsilon)(1-\gamma)^4} 
        \right)$,
    $m_t = \rrEmpNumSamples$\footnote{\label{ftn.tildeO}Here we ignore all terms which are logarithmic in $\sizeS$, $\sizeA$ and $1/\delta$}
        and 
    for any $\delta > 0$, with probability at least $1-p$,
    \begin{equation*}
    \norm{d_t - d_S}_2 \leq \delta
    \text{\ \ for all }  t\geq 
    \frac{\ln\left(\frac{\frac{2}{1-\gamma}+\left(1+\sqrt{2}\right)\sqrt{\sizeS\sizeA}}{\delta}\right)}
        {\ln\left(4/\left(3+\epsilon\right)\right)}
    + 1\ .
    \end{equation*}
\end{theorem}
The bounds here are similar to the bounds in standard Performative RL.
For $\lambda$, there is no $\gamma$ in the numerator and the $\epsilon$ parameters are a bit different in the standard setting. The number of retrainings also has a factor of  $\ln(1/((1-\gamma)\delta))$ in standard Performative RL.

\section{Delayed Repeated Retraining}\label{sec.drr}
\newcommand{\subround}{g}
A different approach inspired by work from~\cite{BHK20}, is to not update the policy every round, but wait a number of~$k$ rounds before each update.
Then the policy is updated using only the environment from the last round of the $k$ deployments.
Algorithm~\ref{algo:delayed-rr-simple} illustrates this approach, called 
\emph{Delayed Repeated Retraining} (DRR).

The advantage of DRR is that during the rounds of repeatedly deploying the same policy, the MDP can somewhat stabilize and the learner might need a lower amount of retrainings and therefore less compute.

For the result, we use the following definition.
\begin{definition}\label{def.distPr}
Let $\distpr$ be the maximal distance between any environment and its successive environment, i.e.
\begin{equation*}
\distpr := \max_{P, r, d}\left(\|\Pc(P,r,d) - P\|_2 + \|\Rc(P,r,d)-r\|_2 \right).
\end{equation*}
\end{definition}
\begin{theorem}[informal, details in Appendix~\ref{appdx.sec.drr-exact}]\label{thm:delayed_rr_standard-simple}
    Let $d_i$ be computed by DRR with $k=\ln^{-1}\left(\frac{1}{\epsilon}\right)\ln\left(\frac{\distpr}{\delta\iota}\right)$.
    Suppose Assumption~\ref{assumption_sensitivity} holds 
    and $\lambda = \calO\left(\frac{\iota\cdot \sizeS^{5/2}}{(1-\epsilon)(1-\gamma)^4}\right)$.
    Then for any $\delta>0$, we have
    \begin{equation*}
    \norm{d_i - d_S}_2 \leq \delta
    \text{\quad for all } i\geq \ln\left(\left(\frac{2}{1-\gamma}\right)/\delta\right)\ .
    \end{equation*}
\end{theorem}
The regularization parameter $\lambda$ has an $\iota$ factor in DRR, but not in RR (see Theorem~\ref{thm:standard-rr-simple}).
The factor $\iota$ is close to $0$, if the MDP does not react strongly to the current policy.
In such settings, the conditions for $\lambda$ in DRR are substantially relaxed.
In addition, the number of retrainings required for DRR is much smaller than for RR.
However, RR may require fewer total rounds than DRR.

\begin{algorithm}
    \caption{Delayed Repeated Retraining}
    \label{algo:delayed-rr-simple}
    \begin{algorithmic}[1]
    \STATE {\bfseries Input:} radius $\delta$, 
    initial transition probability $P_0$ and reward function $r_0$, initial occupancy measure ${d_0}$, number of deployments~$k$

    \FOR{$i=0,1,2,\dots$}
        \FOR{$\subround =1, \dots, k$}
            \STATE \COMMENT{deploy $\pi_{d_i}$:}
            %\STATE deploy $\pi_{d_i}$

            \STATE $P_{i\cdot k+\subround }\leftarrow \Pc(d_i, P_{i\cdot k+\subround -1}, r_{i\cdot k+\subround -1})$

            \STATE $r_{i\cdot k+\subround }\leftarrow \Rc(d_i, P_{i\cdot k+\subround -1}, r_{i\cdot k+\subround -1})$
            
        \ENDFOR
        
        \STATE Update policy to $\pi_{d_{i+1}}$
        \label{delayed_update_d-simple}
    \ENDFOR
    \end{algorithmic}
\end{algorithm}

\subsection{Finite Sample Guarantees}
In DRR with finite samples, the learner again applies the 
same policy for several rounds. After that it updates its policy using samples drawn from the most recent environment. 
For this, the learner uses optimization problem~\eqref{eq:repeated-optim-finite}.
\begin{theorem}[informal, details in Appendix~\ref{appdx.sec.drr-finite}]
    Let $d_i$ be computed by finite sample DRR with $k = 
    \ln^{-1}\left(\frac{1}{\epsilon}\right)\ln\left(\frac{5\cdot\distpr}{
    \delta \iota }\right)$.
    Suppose the Assumption~\ref{assumption_sensitivity} holds and Assumption~\ref{assumption-offline-rl}  holds for $k$ and parameter $B$.
    Let $p>0$.
    Furthermore assume $\lambda= \mathcal{O}\left(\frac{\iota(\sizeS+\gamma \sizeS^{5/2})}{
        (1-\epsilon)(1-\gamma)^4}\right)$.
    %$\xi =\frac{36\sizeS^{1.5}(B+\sqrt{\sizeA})}{\delta^2(1-\gamma)^3}$   
    Then for $m_i =\drrEmpNumSamples$\footref{ftn.tildeO}, and 
    any $\delta > 0$, with probability at least $1-p$,
    \begin{equation*}
    \norm{d_i - d_S}_2 \leq \delta 
    \quad\text{\quad for all } i\geq 
   \frac{\ln\left(\frac{2}{1-\gamma}/\delta\right)}{
   \ln\left(4/\left(3+\epsilon\right)\right)
        } 
    + 1.
    \end{equation*}
    \label{thm:finite-samples-drr-standard-simple}
\end{theorem}
In this result, $\lambda$ has a factor of $\iota$, whereas RR has a factor of $\epsilon$ (see Theorem~\ref{thm:finite-samples-RR-simple}).
In prior work, the difference of $\epsilon$ and $\iota$ was ignored and the two were assumed to be the same~\citep{BHK20}.
As we see here however, interesting properties emerge when we explicitly assume that they are not the same. In settings where the environment does not respond strongly to the current policy, but strongly depends on the previous environment, $\epsilon$ is larger than $\iota$, substantially relaxing the conditions on $\lambda$ for DRR. 
DRR also requires less samples, by a factor of $(1-\epsilon)^4$ when assuming equal $\lambda$. The number of retrainings also is less for DRR.  
Still, RR may need fewer rounds of retraining overall because DRR only retrains every $k$th round. Assumption~\ref{assumption-offline-rl} is stricter for DRR, because it has a larger $k$-parameter than RR.
The $k$ parameter in Assumption~\ref{assumption-offline-rl} indicates how far into future rounds the overlap of occupany measures has to reach.
\section{Mixed Delayed Repeated Retraining (MDRR)}\label{sec.mdrr}
Consider a scenario where in each round the learner gets a limited number of samples from the MDP.
In this scenario, in each training step DRR would use samples from one round only.
But using samples from multiple rounds would allow the learner to use more samples overall, reducing variance and potentially improving convergence.

However, it is challenging to determine how the learner should combine samples from multiple rounds.
Should they optimize using all available samples collectively, or should they use more samples from recent rounds and less from older ones? Additionally, it is uncertain whether such a method would converge and, if so, whether it would offer any benefits. 
To address these questions, we present a novel algorithm that:
\begin{itemize}
\item Uses samples from multiple rounds.
\item Allows for prioritizing recent samples while still incorporating older ones.
\item If the response of the environment depends strongly on the previous MDP, achieves convergence with fewer samples per round. If additionally the number of provided samples per deployment is low, it provides better approximation guarantees.
\end{itemize}

The algorithm uses a new optimization problem, 
which can be viewed as an extension of the previous empirical Lagrangian~\eqref{eq:empirical-Lagrangian} to multiple rounds:
\newcommand{\numsam}{U}
\begin{align}
\begin{split}
    &\hat{\calL}^M(d,h,i) = - \frac{\lambda}{2} \norm{d}_2^2 
    + \sum_s h(s) \rho(s) 
    \\ &
    + \sum_{\subround =1}^{k} \sum_{\genfrac{}{}{0pt}{}{(s,a,r,s')}{\in F_{i\cdot k+\subround }}}
    \frac{1}{\numsam_i}\frac{d(s,a)}{\bar{d}_{i\cdot k+\subround }(s,a)}
    \frac{r - h(s) + \gamma h(s')}{1-\gamma}
\end{split}
\label{eq:lagrangian-finite-mdrr}
\end{align}
Here we define by $\bar{d}_{i\cdot k+\subround }$ the occupancy measure of policy $\pi_{d_i}$ under dynamics~$P_{i\cdot k+\subround }$.
$\numsam_i$ denotes the total number of samples, i.e. $\numsam_i\defeq \sum_{\subround =1}^{k}\abs{F_{i\cdot k+\subround }}$.
The learner thus optimizes over samples from multiple rounds of deployment.

But there is an inherent trade-off: recent samples contain more information about the current environment, but using earlier 
samples allows the total set of samples to be larger.

To balance this trade-off, the approach here is to use more samples from recent rounds and less samples from early rounds.
For illustration, let's assume that the learner didn't update its policy since
MDP $M_{i\cdot k}=(\setS, \setA, P_{i\cdot k}, r_{i\cdot k}, \rho)$ and updates every $k$ rounds.
Then they might take $m$ samples from $M_{i\cdot k+1}$, $mv$ samples from $M_{i\cdot k+2}$ (for $v>1$), 
$mv^2$ samples from $M_{i\cdot k+3}$,
\dots, and $mv^{k-1}$ samples from $M_{i\cdot k+k}$.
If $v$ is close to $1$, the learner takes approximately equal number of samples from all rounds. 
If $v$ is large, and $m$ small, the learner focuses more on recent rounds.
The pseudocode for this approach is shown in Algorithm~\ref{algo:mdrr}, we call it
\emph{Mixed Delayed Repeated Retraining} (MDRR).

\begin{algorithm}
    \caption{Mixed DRR (MDRR)}
    \label{algo:mdrr}
    \begin{algorithmic}[1]
    \STATE {\bfseries Input:} radius $\delta$,
    initial $P_0$ and $r_0$, initial occupancy measure ${d_0}$, hyperparameters $v$ and $k$,
    total number of samples for each round $\numsam_i$

    \FOR{$i=0,1,2,\dots$}
        \FOR{$\subround  = 1, \dots, k$}
            %\STATE \COMMENT{deploy $\pi_{d_i}$:}

            \STATE $P_{i\cdot k+\subround }\leftarrow \Pc(d_i, P_{i\cdot k+\subround -1}, r_{i\cdot k+\subround -1})$

            \STATE $r_{i\cdot k+\subround }\leftarrow \Rc(d_i, P_{i\cdot k+\subround -1}, r_{i\cdot k+\subround -1})$

            \STATE $F_{i\cdot k + \subround }\leftarrow$ draw $\frac{v-1}{v^k-1}v^{\subround -1}\numsam_i$ samples from $(P_{i\cdot k+\subround }, r_{i\cdot k+\subround })$\label{line.mdrr-draw}
        \ENDFOR
        
        %\STATE \COMMENT{update the policy} 
        
        \STATE Update occupancy measure $d_{i+1}\leftarrow 
        \on{arg}\max_d \min_h \hat{\calL}^M(d,h,i)
        %\what{\MR}_{\boldsymbol{w}}^k\left(d_i, P_{i\cdot k}, r_{i\cdot k}\right)
        $\label{eq:minmaxMdrrAlgo}
    \ENDFOR
    \end{algorithmic}
\end{algorithm}

In MDRR the learner uses $m_{i\cdot k+\subround }=\frac{v-1}{v^k-1}v^{\subround -1} \numsam_i$ samples from environment $(P_{i\cdot k+\subround }, r_{i\cdot k+\subround })$ (for each $\subround =1,\dots, k$), where $\numsam_i$ denotes the total number of samples used to compute $d_{i+1}$.

\begin{theorem}[informal, details in Appendix~\ref{appdx.sec.mdrr-theorem-sec}]
    Let $d_i$ be computed by MDRR with
    $k\geq 
    \frac{\ln\left(\frac{\epsilon(v-1)}{v\epsilon-1}\right)+\ln\left(
    \frac{5(1-\epsilon)\distpr}{\iota\delta}\right)
    }{\ln\left(1/\epsilon\right)}$.
    Suppose the Assumption~\ref{assumption_sensitivity} holds and the overlap Assumption~\ref{assumption-offline-rl}  holds for $k$ and parameter $B$.
    Let $p>0$.
    Also assume that $\lambda = \mathcal{O}\left(\frac{\iota(\sizeS+\gamma \sizeS^{5/2})}{
        (1-\epsilon)(1-\gamma)^4}\right)$.
    %$\xi =\frac{36\sizeS^{1.5}(B+\sqrt{\sizeA})}{\delta^2(1-\gamma)^3}$.
    Further let
    $\numsam_i = \drrEmpNumSamples$\footref{ftn.tildeO} be the total number of samples in retraining-round $i$
    and $v>\frac{1}{\epsilon}$.
    Then for any $\delta > 0$, with probability at least $1-p$,
    \begin{equation*}
    \norm{d_i - d_S}_2 \leq \delta
    \text{\quad for all } i \geq 
    \frac{\ln\left(\frac{2}{1-\gamma}/\delta\right)}{
    \ln\left(4/\left(3+\epsilon\right)\right)
        } +1\ .
    \end{equation*}
    %`Algorithm \ref{algo:delayed-finite-samples-mixture} converges'
    \label{thm:edrr-mixed-response-simple}
\end{theorem}
% The main challenge of showing this result is to analyze how to combine and weigh samples from previous rounds such that good convergence properties can be shown. 
The proof of this result involves showing that the empirical Lagrangian~\eqref{eq:lagrangian-finite-mdrr} approximates an exact Lagrangian of the optimization problem where the MDP is a mixture of MDPs from different rounds. 
We then show that the solution to this optimization problem approximates the solution of an exact one-step update with the limiting MDP (i.e. the MDP which the environment converges to if the learner repeatedly applies the current policy).
In a last step we apply arguments similar to the proof of convergence for DRR.

To compare MDRR to RR and DRR, let's first consider the case when $\epsilon$ is close to $1$. This holds when the environment responds strongly to the old environment, for example when the new environment after one step is a slight alteration of the old environment.  We expect this property to hold in many applications, because the environment shift typically happens only slowly over time. We anticipate that MDRR performs particularly well in those settings, because it uses samples from old environments, and if those environments are close to the current environment, those samples are more informative. And indeed, this is what we observe. The number of samples required in line~\ref{line.mdrr-draw} of MDRR is smaller by a factor of $\frac{v^k-v^{k-1}}{v^k-1}$, which converges to $(v-1)/v$ for large $k$. When $\epsilon$ is close to $1$, we can set $v$ close to $1$, resulting in a significant decrease in the required number of samples. 

\input{6.1_figures}

The regularization parameter $\lambda$ is the same as for DRR and has a factor of $\iota$ compared to RR which has a factor of $\epsilon$. But note that the number of samples has a factor of $1/\lambda^2$ in all three algorithms, therefore in settings where there are few samples, one needs larger $\lambda$ to guarantee convergence.
However, because MDRR requires less samples per round than RR and DRR in those settings, it requires smaller values of $\lambda$.
The number of retrainings is similar to DRR and significantly less than for RR.

In general, we see that MDRR performs particularly well in settings where the environment responds strongly to the previous environment in a given round, which likely is a scenario often present in practice.
