\section{Experimental Evaluation}
\label{app:eval}

In this Appendix, we provide more details on the datasets and baselines presented in the main text.
In addition, we provide (as part of the supplementary material) an archive that contains a small software library written in the Python language and computational notebooks (using the Jupyter Notebook format) that enable reproducing the experiments of Sections~\ref{sec:eval} and \ref{app:convergence} from raw data.

\subsection{Datasets}

We provide additional information on the datasets studied in Section~\ref{sec:eval} of the main text.
Summary statistic including the number of states $N$, the number of admissible transitions $\lvert \mathcal{E} \rvert$ and the number of sequences $M$ is provided in Table~\ref{tab:realdata}.

\begin{table*}[t]
  \caption{%
Summary statistics of the four datasets.}
  \vspace{1mm}
  \label{tab:realdata}
  \centering
  \input{tab/realdata}
\end{table*}


\paragraph{\textsc{sleep}.} The dataset is studied by \citet{kneib2008bayesian} and is available on Thomas Kneib's webpage.\footnote{%
See: \url{https://www.uni-goettingen.de/de/551628.html}.}
Each sequence captures the sleep patterns of an individual.
There are three states representing rapid eye-movement (REM) sleep, non-REM sleep, and awake.

\paragraph{\textsc{venticu}.} The dataset is studied by \citet{grundmann2005many} and is available on Richard J. Cook's webpage.\footnotemark[2]
Each sequence represents a patient in an intensive care unit.
The four states capture ventilation (on and off), discharge, and death, respectively.

\paragraph{\textsc{ebmt}.} The dataset is studied by \citet{fiocco2008reduced} and is available on Richard J. Cook's webpage.\footnotemark[2]
Each sequence captures patient outcomes after blood and marrow transplantation.
The six states represent outcomes such as remission, adverse events, relapse, death, and combinations thereof.

\paragraph{\textsc{customers}.} This dataset is not available publicly at this time.
Each sequence represents a customer and their relationship to a business over time.
The three states represent: using the free service, subscribing to the paid service, and not using the service, respectively.

%\subsection{Baselines}
%
%We now give additional details on finite mixtures and a variant of the sequence model of \citet{mackay1995hierarchical}, which we denote MBP in the main text.

\subsection{Finite Mixtures of Markov chains}

In order to train finite mixture models, we follow \citet{cadez2003model}.
We stop the EM algorithm as soon as the log-likelihood increases by less than $0.1 \%$ during one iteration.
In order to select the number of mixture components, we perform a search over $L \in \{2, 3, 5, 10, 20\}$.
We report the results corresponding to the value of $L$ which minimizes the log-likelihood on the hold-out set.

\subsection{Computational Setup}

Our experiments are run on a Google cloud \texttt{n1-standard-32} instance with 32 vCPUs and 120 GB RAM.
Our code relies on the following versions of popular Python packages:
\begin{itemize}
\item \texttt{jax==0.2.13}
\item \texttt{jaxlib==0.1.67}
\item \texttt{numpy==1.19.5}
\item \texttt{scipy==1.6.3}
\end{itemize}

%\subsubsection{\citeauthor{mackay1995hierarchical}}

%\citet{mackay1995hierarchical} propose a sequence model based on a DTMC with a Dirichlet prior, with applications to language modeling.
%The differences with the discrete-time model we introduce in our paper are twofold.
%\begin{enumerate}
%\item \label{itm:mbp1} We use a conjugate distribution as a \emph{mixture} distribution that captures the heterogeneity across sequences.
%\citeauthor{mackay1995hierarchical}, instead, use it as Bayesian prior for a single Markov chain model explaining all sequences (sentences) in the data.

%\item \label{itm:mbp2} We use a generalized Dirichlet distribution instead of a (standard) Dirichlet distribution.
%Furthermore, in \citeauthor{mackay1995hierarchical}'s model, the parameters of the Dirichlet distribution are shared across all rows.
%\end{enumerate}
%Disregarding point~\ref{itm:mbp1}, a reasonable question to ask is: Do the modeling choices in point~\ref{itm:mbp2} also lead to a an effective mixture model?
%We investigate this by instantiating variants of our infinite mixture models.
%Specifically, we change the mixture distributions to
%\begin{align*}
%p(\bm{\Theta} \mid \bm{\eta}) = \prod_i \mathrm{Dir}(\bm{\theta}_i \mid \bm{\eta})
%\end{align*}
%in the discrete-time case, and
%\begin{align*}
%p(\bm{\Lambda} \mid \bm{\alpha}, \beta) = \prod_{i \ne j} \mathrm{Gamma}(\lambda_{ij} \mid \alpha_j, \beta)
%\end{align*}
%in the continuous-time case, with $\bm{\eta}, \bm{\alpha} \in \mathbf{R}^N_{>0}$ and $\beta > 0$.
%Note that these models have $N$ and $N+1$ free parameters, respectively, and are thus more parsimonious than our mixture models.
