\section{Experimental Evaluation}
\label{sec:eval}

\begin{figure*}[t]
  \centering
  \includegraphics{fig/modelfit}
  \caption{%
Negative log-likelihood of test sequences under various models on six datasets (lower is better).
The three left-most datasets contain continuous-time sequences, the right-most dataset contains discrete-time sequences.
}
  \label{fig:modelfit}
\end{figure*}

In this section, we evaluate the performance of our models empirically on four real-world datasets.
%We outline the experimental procedure in Section~\ref{sec:procedure}.
First, we investigate model fit (Section~\ref{sec:modelfit}) and running time (Section~\ref{sec:runningtime}) on all four datasets.
Then, we focus on two applications and evaluate our models on state prediction tasks (Section~\ref{sec:prediction}).

\paragraph{Datasets.}
The datasets we study contain sequences describing
sleep patterns (\textsc{sleep} \citep{kneib2008bayesian}),
two types of clinical treatments and outcomes (\textsc{venticu} \citep{grundmann2005many}, \textsc{ebmt} \citep{fiocco2008reduced}),
and customers' relationship with the Spotify audio streaming service (\textsc{customers}).
The first three datasets contain continuous-time sequences, whereas the last dataset contains discrete-time sequences.
The number of states $N$ ranges between $2$ and $6$, and, for all but the last dataset, the set of admissible transitions $\mathcal{E}$ is a strict subset of all possible transitions.
The transition graphs of \textsc{ebmt} and \textsc{customers} are illustrated in Figure~\ref{fig:chains}.
A more comprehensive description of each dataset is given in Appendix~\ref{app:eval}.

\paragraph{Experimental Procedure.}
Taking the discrete-time case as example, we proceed as follows.
We train our models by estimating the parameter matrices $\bm{A}, \bm{B}$ of the generalized Dirichlet mixture distributions.
We do so by minimizing the negative (marginal) log-likelihood~\eqref{eq:dtmixnll} on a training set.
At test time, we make use of the parameters estimated during training to make predictions about each sequence in an independent test set.

\paragraph{Competing Approaches.}
We compare our infinite mixture models against \begin{enuminline}
\item plain Markov chains (denoted by CTMC or DTMC),
\item finite mixture models trained using EM, and
\item variants of RNNLM \citep{mikolov2010recurrent}.
\end{enuminline}
For finite mixtures, we choose the number of components $L$ by cross-validation.
For the RNN baseline, we note that our goal is not to find the optimal architecture but rather to anchor our results against a well-known representative of this class of models.
While discrete-time RNNs are well established, continuous-time variants are still under active research~\citep[see discussion in][]{rubanova2019latent}.
For our purposes, we extend the RNNLM to continuous-time sequences as follows.
At each step, in addition to transition probabilities, we output a transition rate that is a (learned) function of the RNN's hidden state.

\paragraph{Features.}
The \textsc{ebmt} and \textsc{customers} datasets contain, in addition to the sequences themselves, feature vectors that describe characteristics of patients and customers, respectively.
In this case, we can combine sequence models with a regression model.
We do so by replacing the fixed parameters of a sequence model (e.g., the Markov chain transition matrix $\bm{\Theta}$ or the $\mathrm{GDir}$ parameters $\bm{A}, \bm{B}$) with a learned function of the sequence features.
For simplicity, we only consider Markov chains, finite mixtures, and our infinite mixtures in combination with an independent log-linear regression model for each parameter (see Section~\ref{sec:dtmodel}).

\paragraph{Reproducibility.}
A software library implementing the models and computational notebooks enabling the reproduction of the results presented in this section are provided online.\footnote{%
See: \url{https://github.com/spotify-research/mixmarkov}}
All but one dataset (\textsc{customers}) is publicly available online.
Links to the datasets and additional details on the experimental procedure are provided in Appendix~\ref{app:eval}.


\subsection{Model Fit}
\label{sec:modelfit}

We start by reporting the average negative log-likelihood of various models on held-out sequences using $10$-fold cross-validation.
The NLL provides a consistent and meaningful goodness-of-fit measure for all datasets, irrespective of the application domain.
It evaluates the models' ability to jointly predict the identity of the next state and the time until the transition occurs; a lower value corresponds to a better model.
%Informally, the NLL quantifies how ``surprising'' a sequence is to a model.
%Informally, a model with lower NLL captures the sequences ``better'', in a probabilistic sense.
%That is, we partition each dataset into ten subsets of equal size, and, for each partition $p = 1, \ldots, 10$, we evaluate the NLL on partition $p$ by using a model trained on the $9$ other partitions.

We present results in Figure~\ref{fig:modelfit}.
Our models, highlighted in dark blue, outperform competing approaches on all datasets.
Plain Markov chains perform poorly, suggesting that, in all the datasets that we consider, the entire past of a sequence is useful to predict its future (we will revisit this observation in Section~\ref{sec:prediction}).
At the other end of the expressivity spectrum, our results also suggest that RNNs underperform other methods in particular when the dataset is small (\textsc{sleep}), or when sequences are short but the timing of transitions is critical (\textsc{venticu}, \textsc{ebmt}).
Well-tuned finite mixture models perform well, and in some cases they are close to matching the performance of our infinite mixture models (\textsc{venticu}).
%However, we observe slow convergence on the two datasets with features (\textsc{ebmt}, \textsc{customers}).
%We suspect that, when used in conjunction with regression models, finite mixtures trained using EM are particularly prone to get stuck in poor local minima.
%Note that the features help explain part of the heterogeneity observed across sequences;
%This suggests that EM might struggle to separate the heterogeneity that can be explained through the features from the remaining, unexplained heterogeneity that is captured across multiple components.
%Conversely, our infinite mixture models' simpler parametrization appears to couple well with regression models.
%This is consistent with the observations of \citet{hubbard2021beta} for the beta-logistic survival model, a special case of our discrete-time model.

\subsubsection{Visualizing Model Fit on \textsc{customers}}

The \textsc{customers} dataset represents the trajectories of \num{144510} users of the Spotify audio streaming service\footnote{%
See: \url{https://spotify.com}.}
over $N = 3$ states.
Users can use the free version of the service (state \num{1}), subscribe and get unrestricted access to all features (state \num{2}), or stop using the service (state \num{3}).
A transition can occur between any pair of states (see Figure~\ref{fig:chains}, right).
Each sequence starts when the user registers to the service and ends after $T = 20$ steps.

In Figure~\ref{fig:spotistates}, we visualize the fit of a DTMC and a infinite mixture model.
We represent the empirical fraction of paying, free and inactive users over time by using blue, orange and green bars, respectively.
We indicate the predictive state distribution obtained from the mixture model by using solid lines.
Similarly, we use dotted lines to indicate the predictive state distribution obtained from the DTMC.
We observe that the mixture model matches the empirical distribution significantly better than the DTMC.\footnote{%
The finite mixture model is not represented in Figure~\ref{fig:spotistates}, but its fit is also excellent, and almost indistinguishable from that of the infinite mixture model.}
Notice how the number of active users (free and paid) decreases steeply after one time step, but then flattens out rapidly.
This is a concrete example of the Lindy effect.

\begin{figure}[t]
  \centering
  \includegraphics{fig/spotistates}
  \caption{%
Empirical distribution of users over states (blue, orange and green bars) and predicted distributions,
%based on a DTMC (dotted line) and an infinite mixture model (solid line)
as a function of time.}
  \label{fig:spotistates}
\end{figure}

\subsection{Running Time}
\label{sec:runningtime}

Comparing the computational footprint of different models is challenging, as implementation choices can significantly impact results.
However, given that Markov chains, finite mixtures and infinite mixtures share many building blocks, we believe that comparing the relative running time of inference in these three models provides insights that will generalize to implementations beyond ours.
We use the running time of plain Markov chains as a baseline for each dataset.
For finite mixtures of $L$ components, assuming that EM converges in $I$ iterations, parameter inference is dominated by $L \cdot I$ calls to a Markov chain inference subroutine.
For infinite mixture models, inference is similar to a Markov chain in that it consists of solving a single, well-behaved optimization problem that can be outsourced to off-the-shelf software.

Figure~\ref{fig:runningtime} compares the running time of the two types of mixtures models, normalized by the running time of plain Markov chains and aggregated over all the datasets.
We observe that training infinite mixture models takes approximately $27 \times$ less time than training finite mixture models.
Combined with the predictive edge observed in Figure~\ref{fig:modelfit}, we believe that this makes our models a compelling alternative to finite mixtures.

\begin{figure}[t]
  \centering
  \includegraphics{fig/runningtime}
  \caption{%
Median and interquartile range for the running time of finite and infinite mixture models, normalized by the running time of a single Markov chain.
The median normalized running time is \num{5.09} and \num{139.28}, respectively.}
  \label{fig:runningtime}
\end{figure}

\subsection{Predictive Tasks}
\label{sec:prediction}

Next, we focus on the \textsc{ebmt} and \textsc{customers} datasets and consider two concrete state predictions tasks.% relevant to practical applications.

\subsubsection{Outcomes in Bone Marrow Transplantations}

The \textsc{ebmt} dataset describes patients undergoing bone marrow transplantation, a standard treatment for acute leukemia.
The dataset contains trajectories tracking clinical outcomes from the moment the transplantation occurs and spanning up to 18 years.
At any time, patients are in one of $N = 6$ states describing the occurrence of adverse events, remission, full recovery, relapse and death.
Most patients only go through two or three state transitions.
In addition to the trajectory itself, patients are also described by a feature vector encoding demographic and treatment information.
The transition graph is depicted in Figure~\ref{fig:chains} (left), and more details on the data can be found in \citet{fiocco2008reduced}.

We restrict our attention to patients followed over at least 5 years and consider the following task.
Given the trajectory of the patient up to day \num{60}, predict the patient's state on day \num{1800}.
Being able to accurately estimate the probability of various future outcomes in a personalized way, by using features and recent history, could help identify and follow at-risk patients.
We train three different models\footnote{%
We also experimented with an RNN, but sampling continuous-time trajectories proved difficult and led to poor results.
} in combination with log-linear regression models and compute the predictive state distribution after $T = 1800$ days on held-out sequences.
For each model, the predictive state distribution is computed in two different ways: by using the state on day \num{60} only, and by using the entire trajectory up to day \num{60}.

We evaluate the prediction by measuring the log-loss given the prediction and the true observed state at the end of the five-year horizon, and we present the results in Figure~\ref{fig:prediction} (left).
We observe that using information about the past results in better predictions for models that can take advantage of it, again demonstrating that the Markov assumption is too restrictive.
In addition, we observe that the infinite mixture model provides the most accurate predictive state distribution.

\begin{figure}[t]
  \centering
  \includegraphics{fig/prediction2}
  \caption{%
Predictive performance on two state prediction tasks.
Predictions are made without and with information about sequences' past (in light and dark green, respectively).}
  \label{fig:prediction}
\end{figure}


\subsubsection{Modeling Customer Relationships}
\label{sec:customers}

We set up a similar task on the \textsc{customers} dataset.
Given the first \num{3} time steps of the sequence, we seek to predict the state at step $T = 20$.
This task is a multistate extension of the popular problem of estimating customer retention \citep{fader2007how, hubbard2021beta}.
Prior work on modeling complex customer relationships has relied on Markov chains \citep{pfeifer2000modeling, schwartz2011children}.

Similarly to the clinical application, we train different sequence models in combination with log-linear regression models, and we compute the predictive state distribution $\bm{\pi}_T$ on held-out sequences.
For each model, we make two predictions: the first one only takes the last observed state $s_3$ into account, whereas the second takes the entire past $(s_1, s_2, s_3)$ into account.
The results are presented in Figure~\ref{fig:prediction} (right).
Our findings mirror those obtained on the \textsc{ebmt} datasets.
Making use of a sequence's past, even if it only consists of three steps, significantly improves the prediction for all models, and our infinite mixture models outperform competing methods on this task.
