\section{Scope of reproducibility}
\label{sec:intro-scope}

\citet{shafiullah2022behavior} train BeT on expert trajectories exhibiting multiple modes in four different environments of varying complexity. 
For each environment, they compute two types of metrics on the rollouts generated by BeT:

\begin{enumerate}
    \item \emph{Performance} metrics to measure the performance of the model during the rollouts, e.g., the number of completed tasks. 
    This directly supports their first claim (\ref{claim:author-1}).
    \item \emph{Diversity} metrics to measure the diversity of the rollouts, e.g., the entropy of the completed tasks.
    This directly supports their second claim (\ref{claim:author-2}).
\end{enumerate}

To assess the claims above, we reproduce the experiments performed in three of the four environments using BeT with hyperparameters specified by the authors.
We find that, although we cannot produce the authors' results for all experiments, BeT is a generally robust method that achieves the stated goals of the paper.

In addition to reproducing the previous work, we also assess critical design choices and claims, addressing the following questions:
\begin{enumerate}
\setcounter{enumi}{2}
    \item Are all proposed components of the BeT architecture relevant to \textit{both} the performance and the diversity of its rollouts?\label{claim:ablation}
    \item How sensitive is BeT to critical hyperparameters such as $k$ and $h$?\label{claim:hyperparameters}
    \item Are the design choices made in the evaluation metrics empirically justified?\label{claim:evaluation-metrics}
\end{enumerate}