\section{Point mass figures}\label{app:point-mass-figures}

This section presents the results obtained by running models on the Point mass environment.
In particular, we present the rollouts obtained for BeT (hyperparameters as specified in the original paper~\cite{shafiullah2022behavior}) and for the trunk LSTM ablation (manually fine-tuned in the domain described by Table \ref{tab:hyper_lstm}) in figure \ref{fig:pointmass-results}.
Moreover, in figure \ref{fig:pointmass-limitation}, we present the limitation of BeT exposed in section \ref{sec:results_beyond_paper}.

\begin{table}[htb]
\centering
\caption{Unlike the original work, where the LSTM suffered from training collapse,
we find that using a two-layer LSTM trunk is a robust alternative to using a transformer decoder.
Hyperparameters for the LSTM model were tuned manually instead of using automated search algorithms
on the Point mass environment. The best hyperparameters are in bold.}
\label{tab:hyper_lstm}
\resizebox{0.42\textwidth}{!}{
\begin{tabular}{lc}
Hyperparameters                         &                \\ \hline
\multicolumn{1}{l|}{Optimizer}          & \textbf{Adam}, AdamW  \\
\multicolumn{1}{l|}{Adam(W) $\beta_2$}  & \textbf{0.95}, 0.999  \\
\multicolumn{1}{l|}{Hidden Width}       & 512, \textbf{1024}    \\
\end{tabular}
}
\end{table}

\begin{figure}[htb]
\includegraphics[width=0.3\textwidth]{figures/bet_gpt_pm2_snapped.png}
\includegraphics[width=0.3\textwidth]{figures/bet_lstm_pm2_snapped.png}
\includegraphics[width=0.3\textwidth]{figures/pointmass_dataset2.png}
\centering
\caption{Point mass 2 rollout results with state snapping for models trained on 20,000 training samples without noise.
The models used are BeT (left) and an LSTM (middle), both of which target the same dataset (right).
Hyperparameters for BeT as specified by \citet{shafiullah2022behavior}.
BeT's performance during rollouts is high, but it deviates quite often from the given trajectories.
The LSTM's performance is almost perfect, with slight deviations around the last cell.
We can clearly distinguish the different modes in the rollouts generated by both models.
}
\label{fig:pointmass-results}
\end{figure}

\begin{figure}[htb]
\includegraphics[width=0.4\textwidth]{figures/pointmass_dataset1.png}
\includegraphics[width=0.4\textwidth]{figures/pointmass_rollout.png}
\centering
\caption{Left: Point mass 1 dataset without noise.
There are two modes in the behaviors presented.
Right: Rollout exhibiting a limitation of BeT in a simple scenario.
It cannot capture the two modes although it has $k=2$ action center bins, the correct number of modes.
This happens when $k$-means converges to vectors that do not allow the model to distinguish the modes with $k$ bins.
}
\label{fig:pointmass-limitation}
\end{figure}
