\section{Introduction}
\label{sec:intro}

%Introductory paragraph:
%- modeling discrete-time or continuous-time sequences over multiple states is important in a number of applications.
%- give examples: users in different product tiers, patients at different stages of an illness, individuals over time, etc.
Understanding, modeling and predicting trajectories over multiple states is of central importance in a wide range of applications.
For example, in a clinical setting, patients go through several different stages from illness to recovery \citep{putter2007tutorial}.
In a business setting, customers' relationship with a company evolves over time.
A customer might start with a free service and later move on to a paid subscription or stop using the service altogether \citep{pfeifer2000modeling}.
%In demographic studies, the sequence and timing of an individual's life events (jobs, children, etc.) contains valuable information about society \citep{willekens2014multistate}.
These processes can be viewed as discrete-time or continuous-time sequences over a discrete state space.

% survival analysis & how we differ from it.
%- often, beyond making pointwise prediction, the goal is to have a probabilistic model of trajectories, predict likely trajectories.
In the simplest case, there are only two states and a single transition:
Every sequence starts in the first state and ends in the second state.
For example, we might be interested in modeling the time a patient takes from admission to a hospital (state $1$) to release (state $2$).
This is the setting of survival analysis, the branch of statistics that studies time-to-event data \citep{wang2019machine}.
In this paper, we address a more general setting, where the number of states can be larger than two and the set of admissible transitions can be arbitrary \citep{cook2018multistate}.
% TODO Make sure we talk about recurrent event, competing risks etc in rel work
We focus on developing models that accurately capture both the sequence of states and the timing of the transitions.
In applications, we use these models to make probabilistic predictions about the future of a sequence given its past.

%\subsection{Existing Approaches}

Markov chains \citep{norris1998markov} are a popular class of models used to analyze multistate sequences.
They come in discrete-time and continuous-time variants, are well-understood theoretically and easy-to-use in practice.
One of their strengths is that most problems of interest (learning, prediction, etc.) are tractable, either in closed form or through simple recursive algorithms.
However, Markov chains rely on a strong assumption, \emph{memorylessness}, which informally states that future transitions are independent of the past given the present.
In practice, this assumption is often too restrictive and can lead to poor predictions.
For example, Markov chains are unable to capture the \emph{Lindy effect} \citep{goldman1964lindy}, which contends that the longer a process is in a given state, the longer it is expected to remain in that state, and which has been empirically verified in a number of real-world applications \citep{mandelbrot1982fractal, taleb2012antifragile}.

A common approach to address the limitations of Markov chains is to consider mixtures thereof \citep{frydman1984maximum, poulsen1990mixed, cadez2003model, girolami2003simplicial, frydman2005estimation, gupta2016mixtures}.
In short, finite mixture models assume that each sequence follows one of $L \ge 2$ distinct Markov chains.
Inference requires explicitly learning the parameters of the $L$ Markov chains and mixture weights associated with each sequence, typically using the EM algorithm \citep{dempster1977maximum}.
This approach provides increased modeling flexibility but does so the expense of tractability and simplicity.
As we show in Section~\ref{sec:runningtime}, running the EM algorithm to convergence requires two orders of magnitude more resources than fitting a single Markov chain.
The likelihood function is prone to having poor local maxima, thus necessitating multiple runs with different seeds \citep{cadez2003model}.
These difficulties are compounded by the fact that $L$ is usually not known a priori and needs to be chosen and validated empirically.

%, and the goal is to jointly learn the parameters of the Markov chains and assignments of observed sequences to 
%These models instantiate $L \ge 2$ Markov chains and jointly learnby considering mixtures 
%ranging from time-inhomogeneous Markov chains to recurrent neural networks (RNNs) \citep{elman1990finding}.

\subsection{Our Contribution}

In this work, we seek to combine the rich dynamics enabled by mixture models with the convenience and computationally-friendly nature of plain Markov chains.
To this end, we develop models of discrete-time and continuous-time sequences based on localized Bayesian Markov chains, following the general construction of \citet{wang2018general}.
Informally, we consider that each sequence follows a latent Markov chain whose matrix of transition rates (in the continuous-time case) or transition probabilities (in the discrete-time case) is sampled from an auxiliary mixing distribution with infinite support (Section~\ref{sec:model}).
We refer to these models as infinite mixtures of Markov chains.
The resulting compound process is more expressive than a Markov chain and can capture a wider range of patterns.
Furthermore, by choosing the mixing distribution appropriately, the likelihood of a trajectory has a simple closed-form expression, and inference becomes significantly easier than for finite mixtures.
We are also able to derive computationally-efficient algorithms for the predictive state distribution (Section \ref{sec:predict}).
%The likelihood of a trajectory has a simple closed-form expression, and the parameters of the mixing distribution can be effectively learned by maximum-likelihood estimation.
Our method can be understood as a generalization of two well-known parametric models used in survival analysis, the beta-logistic and Lomax distributions \citep{heckman1977betalogistic, lomax1954business}, to arbitrary transitions over multiple states.

We evaluate our models empirically on four datasets covering physiological signals, clinical treatment outcomes and customer relationships (Section~\ref{sec:eval}).
When, in addition to the sequences themselves, feature vectors are available, our models can be seamlessly combined with regression models.
We find that, in each of these datasets, the Markov assumption is too restrictive, and information about a sequence's past helps predicting its future.
Our models' predictions outperforms finite mixtures and RNNs, suggesting that the inductive biases of our models are well-suited to these domains.
%Furthermore, our method's running time compares favorably to alternatives.
All in all, we believe that our method will be a valuable addition to the practitioner's toolbox.

\paragraph{A Note on Terminology.}
We call our models \emph{infinite} mixtures of Markov chains to emphasize the fact that the (parametric) mixture distribution has infinite support.
Our models are distinctly different from nonparametric models such as the infinite Gaussian mixture model \citep{rasmussen1999infinite}, the infinite HMM \citep{beal2001infinite}, and the model of~\citet{reubold2017infinite}, which use a Dirichlet process to implicitly capture a variable number of mixture components or latent states.
