% !TEX root =  main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experiments}\label{sec.experiments}
\paragraph{Environment} In order to compare the three algorithms in a fair and tractable experimental setup, we use a 
variation of the experimental testbed from \cite{MTR23}, with two agents controlling an actor in a grid-world. In our testbed, agent $A_1$ proposes a control policy for the actor and $A_2$ responds by overriding some of the actions taken by the control policy. Hence, $A_1$'s effective environment is performative.
More information about this experimental setup can be found Appendix~\ref{appdx.explanation-env}.

To simulate a slow response, $A_2$
plays a weighted combination of its last policy and a softmax of its optimal $Q$-values.
Specifically, the policy of $A_2$ in round $i$ is
\begin{equation}\label{eq.agent2}
\pi_i^2(a | s) = w\cdot \frac{e^{Q_2^{*|\pi_1}(s,a)}}{\sum_{a'\in A}e^{Q_2^{*|\pi_1}(s,a')}} +  (1-w) \cdot \pi_{i-1}^2(a | s)
\end{equation}
Here $Q_2^{*|\pi_1}(s,a)$ are the optimal $Q$-values for $A_2$,
while $w$ describes the responsiveness of the environment towards the deployed policy of $A_1$.
For small $w$, the environment responds strongly to the current policy,
while for large $w$ the environment is less responsive to the current policy.

\paragraph{Implementation}
We study the finite sample setting, and sample trajectories instead of taking single samples from occupancy measures.
The learner solves the min-max-problem~\eqref{eq:repeated-optim-finite} using a follow-the-regularized-leader algorithm described in Appendix~\ref{appdx.ftrl}.
To evaluate the speed at which the algorithms reach a stable occupancy measure, we evaluate how the occupancy measure at each round compares to the average of the last $10$ occupancy measures, which we denote by~$d_{\on{last}}$.\footnote{Code available at \url{https://github.com/rank-and-files/performative-rl-gradually-shifting-envs}}

\paragraph{Performance}
In Figures~\ref{fig:samples1000-to-last-iteration} and~\ref{fig:samples1000-to-last-iteration-w15} we see that
MDRR converges the fastest to~$d_{\on{last}}$. This is true both for the setting where the environment
changes faster ($w=0.5$, Figure~\ref{fig:samples1000-to-last-iteration})
and when it changes more slowly ($w=0.15$, Figure~\ref{fig:samples1000-to-last-iteration-w15}).
This is the case even though MDRR uses less retrainings than RR.
But MDRR uses more samples per retraining, 
and this seems to lead to better convergence properties
in the exposed settings.
This also means that MDRR has lower variance, as indicated by the smaller confidence intervals. 
In Appendix~\ref{appdx.large-w} we additionally study settings with larger values of $w$, where the environment is more dynamic. Also here MDRR significantly outperforms RR and DRR.

\paragraph{Choice of Hyperparameters}
As we can see in Figure~\ref{fig:ks-to-last-iteration}, the convergence properties of MDRR for different values of $k$ are similar.
As we can see in Figure~\ref{fig:vs-to-last-iteration}, in the range of $v=1.1$ to $v=1.8$,  there does not seem to be much difference in speed of convergence. The results indicate that MDRR is robust to the choice of its hyperparameters.

\paragraph{Compute details}
The experiments in Figure~\ref{fig:main_plots} were conducted on a compute cluster with each machine having 4 Intel Xeon E7-8857 v2 CPUs and 1.5 TB of RAM. It took approximately 80 to 100 hours per algorithm to complete each experiment.