\section{Related Work}
\label{sec:relwork}

Sequential data is ubiquitous, and unsurprisingly the literature on models and methods for dealing with such data is vast.
Our work addresses applications where the number of states $N$ is finite and typically small with respect to the size of the data, and where accurately modeling the timing of transitions is of particular interest.
Correspondingly, we focus our review on the most relevant subset of the literature.
% This is in contrast with ML literature on sequential models for text, sound, video, etc.

\paragraph{Survival Analysis.}
This field provides the statistical framework for analyzing time-to-event data, i.e., data related to a single transition from one state to another \citep{klein2003survival}.
\citet{wang2019machine} give a recent survey of the field that highlights the connections to machine learning.
A popular non-parametric approach to summarizing time-to-event data is given by the Kaplan-Meier estimator \citep{kaplan1958nonparametric}.
Alternatively, one can postulate a parametric survival distribution and infer the parameters from observed data.
The discrete-time beta-logistic model \citep{heckman1977betalogistic} and the continuous-time Lomax model \citep{lomax1954business} are instances of this approach.
The models we develop in this work can be seen as a generalization of these two distributions to multiple states and arbitrary sequences.
The beta-logistic model was recently revisited by \citet{hubbard2021beta}, who report favorable results when used in conjunction with powerful function approximators.
%\citet{zhong2019survival} survival regression can be cast as a classification problem.
%\citet{fader2019exploring} compares gamma-exponential and beta-geometric and finds that they are for all practical purposes equivalent.
%\citet{chapfuwa2018adversarial} recent paper at ICML that shows that ML researchers work on this.

\paragraph{Multistate Models.}
Some methods developed for survival analysis have been extended to handle transitions between $N > 2$ states \citep{aalen2008survival, putter2007tutorial, cook2018multistate}.
For example, the Aalen-Johansen estimator generalizes the Kaplan-Meier estimator to trajectories over multiple states \citep{aalen1978empirical}.
Most methods discussed in the literature are based on a Markov chain model, i.e., they assume that future transitions only depend on the current state.
Extensions include time-inhomogeneous or semi-Markov variants, where transition rates can also depend on the absolute time or on the time since the last transition occurred.
Fully non-Markovian estimators have recently been proposed \citep{titman2015transition, putter2018nonparametric}, but they are challenging to use in practice, especially in the small-data regime.
Our models are not Markovian---future transitions can depend on the entire history of the process---yet they remain parsimonious, necessitating only twice the number of parameters required to describe a (homogeneous) Markov chain.
%\citet{hougaard1999multistate} an early review on multistate models (318 cites).
%\citet{willekens2014multistate} Book on multistate models with applications in sociology.
%\citet{andersen2002multistate} Good (albeit old) review that is widely cited (560 cites)
%\citet{titman2020general} Test of Markov property in multistate models.

\paragraph{Mixtures of Markov Chains.}
The idea of combining multiple Markov chains into a mixture model in order to capture heterogeneity across or within sequences dates back to the 1950s \citep{blumen1955industrial}.
\citet{frydman1984maximum, frydman2005estimation} studies a two-component \emph{mover-stayer} model and its extension to $L \ge 2$ components, with applications to social and financial processes.
\citet{poulsen1990mixed} and \citet{cadez2003model} use a Markov chain mixture model to cluster customers and users of a website, respectively.
Maximum-likelihood inference relies on the EM algorithm \citep{dempster1977maximum}.
More recently, \citet{gupta2016mixtures} propose an alternative spectral inference algorithm with favorable theoretical properties.
\citet{girolami2003simplicial} present a different type of Markov chain mixture model where components can be interleaved within a sequence.
In contrast to existing work, our approach learns a continuous mixture distribution instead of $L$ discrete components.
We compare our models against finite mixture models in Section~\ref{sec:eval}.

\paragraph{Bayesian Inference for Markov Chains.}
Our models make use of mixture distributions that are conjugate for the likelihood functions of Markov chains.
Some of these relationships are well-known and have been used for Bayesian inference of Markov chain parameters, such as in \citet{mackay1995hierarchical} and in \citet[Chapter 23]{barber2012bayesian}.
In that case, the main goal is to account for the epistemic uncertainty over a single set of parameters due to finite data.
Our work is closer in spirit to \citet{wang2018general}, who consider a general framework to transform a classical Bayesian model into a localized one.
In our case, a different set of parameters is associated to each sequence, and the Bayesian prior captures \emph{heterogeneity} across sequences.
To the best of our knowledge, our work is the first to take advantage of conjugate distributions to learn a mixture of Markov chains.

\paragraph{Modeling Customer Relationships.}
Our work is also related to and influenced by literature on modeling customer retention \citep{fader2009probability}.
\citet{schmittlein1987counting} estimate time-to-churn by means of a (latent) Lomax survival model.
\citet{fader2007how} consider a discrete-time variant and use a beta-logistic model.
Beyond retention, \citet{pfeifer2000modeling, paauwe2007dtmc, schwartz2011children} propose multistate models of customer relationships based on Markov chains.
We apply our models to a customer relationship dataset in Section~\ref{sec:eval}.
%\citet{wang2019deep} ICLR paper that introduces a deep probabilistic model for LTV.
%\citet{ross2018customer} Address freemium application, but don't model multiple states explicitly.
%Look at multiple definitions of retention and how predictive they are of monetization.
