\section{Predictive State Distribution}
\label{app:predict}

In this appendix, we first develop polynomial-time algorithms for computing the exact predictive state distribution in the discrete-time setting.
Then we provide the continuous-time equivalent of Algorithm~\ref{alg:predict} presented in the main text.
We conclude with an empirical study of Algorithm~\ref{alg:predict}.

\subsection{Directed Acyclic Graphs with Self-Loops}

We assume that the transition graph $\mathcal{G} = ([N], \mathcal{E})$ is such that there are no directed cycles, except for self-loops.
Intuitively, this restriction means that once the process leaves a state, it will never return to that state in the future.
This special case is relevant in practice: For example, the \textsc{ebmt} dataset we consider in Section~\ref{sec:eval} of the main text satisfies the assumption.

For conciseness, we consider only the (simple) Dirichlet mixture model introduced in Appendix~\ref{app:dirmix}.
The extension to the generalized Dirichlet distribution is straightforward.
We assume that the parameter matrix $\bm{H} = [\eta_{ij}]$ is such that $\eta_{ij} = 0$ if $(i, j) \notin \mathcal{E}$.
Let $z_i$ be the time at which state $i$ is first reached and let $k_{ii}$ count the number of self-transitions on state $i$.
Then,
\begin{align}
\label{eq:dag1}
\pi^\star_{T,i} = \sum_{t = 0}^T \mathbf{P}[z_i = t] \mathbf{P}[k_{ii} \ge T-t].
\end{align}
Starting with $\mathbf{P}[k_{ii} \ge 0] = 1$ and $\mathbf{P}[z_i = 0] = \pi_{0i}$, we can compute the required quantities recursively as
\begin{align}
\mathbf{P}[k_{ii} \ge t]
    &= \mathbf{P}[k_{ii} \ge t-1] \cdot \frac{\eta_{ii} + t-1}{\sum_\ell \eta{i\ell} + t-1}, \label{eq:dag2} \\
\begin{split}
\label{eq:dag3}
\mathbf{P}[z_i = t]
    &= \sum_j \sum_{t' = 1}^{t-1} \bigg[ \mathbf{P}[z_j = t'] \cdot \mathbf{P}[k_{jj} \ge t - t'] \\
    &\qquad \cdot \frac{\eta_{ji}}{\sum_\ell \eta_{j\ell} + (t - t')} \bigg],
\end{split}
\end{align}
for $t = 1, \ldots, T$.

This explicit decomposition of the predictive state distribution leads to a proof of Proposition~\ref{thm:dagpredict}, which we briefly recall here.
\begin{proposition}
Let $(\bm{A}, \bm{B})$ be any generalized Dirichlet mixture of Markov chains on a graph $\mathcal{G} = ([N], \mathcal{E})$, and let $\bm{\pi}_0$ be an initial state distribution.
If $\mathcal{G}$ has no cycle of length greater than one, then $\bm{\pi}^\star_T$ can be computed exactly in time $O(T^2 N^2)$.
\end{proposition}
\begin{proof}
The predictive state distribution $\bm{\pi}^\star_T$ can be computed exactly using~\eqref{eq:dag1}, \eqref{eq:dag2} and~\eqref{eq:dag3}.
There are $N \cdot T$ distinct quantities to compute for~\eqref{eq:dag2}, each with running time $O(1)$.
Similarly, there are $N \cdot T$ distinct quantities to compute for~\eqref{eq:dag3}, each with running time $O(NT)$.
Finally,~\eqref{eq:dag1} involves $N$ distinct quantities with running time $O(T)$ each.
Adding up the contributions, the total running time is $O(NT \cdot 1 + NT \cdot NT + N \cdot T) = O(T^2N^2)$.
\end{proof}

\subsection{Exact Algorithm for the General Case}

Given a discrete-time mixture model, an initial distribution $\bm{\pi}_0$ and a time horizon $T$, we seek to predict the marginal state distribution after $T$ steps, $\bm{\pi}^\star_T$.
A naive solution involves of enumerating all paths of length $T$, with running time exponential in $T$.
We now introduce an alternative procedure that computes $\bm{\pi}^\star_T$ exactly with running time polynomial in $T$.

Let $\bm{K}_t \in \mathbf{N}^{N \times N}$ be a matrix counting the number of times each transition has occurred up to time $t$.
We write
\begin{align*}
\bm{\pi}^\star_T = \sum_{\bm{K}} \mathbf{P}[s_T = i, \bm{K}_T = \bm{K}],
\end{align*}
where $\bm{K}$ ranges over all integer-valued matrices whose entries sum up to $T$.
Starting from
\begin{align*}
\mathbf{P}[s_0 = i, \bm{K}_0 = \bm{0}_{N \times N} ] = \pi_{0i},
\end{align*}
we can recursively compute
\begin{align*}
&\mathbf{P}[s_t = i, \bm{K}_t = \bm{K}] = \sum_{j} \mathbf{P}[s_t = i, s_{t-1} = j, \bm{K}_t = \bm{K}] \\
&\qquad = \sum_{j} \mathbf{P}[s_t = i, s_{t-1} = j, \bm{K}_{t-1} = \bm{K} - \bm{\Delta}^{ji}] \\
&\qquad = \sum_{j} \Big( \mathbf{P}[s_t = i \mid s_{t-1} = j, \bm{K}_{t-1} = \bm{K} - \bm{\Delta}^{ji}] \\
&\qquad \qquad\qquad \cdot \mathbf{P}[s_{t-1} = j, \bm{K}_{t-1} = \bm{K} - \bm{\Delta}^{ji}] \Big)
\end{align*}
for $t = 1, \ldots, T$,
where $\bm{\Delta}^{ij}$ is the $N \times N$ indicator matrix whose entry $(i,j)$ is $1$ and all other entries are $0$, and where
\begin{align*}
&\mathbf{P}[s_t = i \mid s_{t-1} = j, \bm{K}_{t-1} = \bm{K}] \\
&\qquad = \left( \frac{\alpha_{ji} + k_{ji}}{\alpha_{ji} + \beta_{ji} + \sum_{o \ge i} k_{jo}} \right)^{\mathbf{1}_{\{i \ne N\}}} \\
&\qquad\qquad\qquad \cdot \prod_{\ell = 1}^{i - 1} \frac{\beta_{j\ell} + \sum_{o > \ell} k_{jo}}{\alpha_{j\ell} + \beta_{j\ell} + \sum_{o \ge \ell} k_{jo}}.
\end{align*}
In the case of the standard Dirichlet distribution (see Appendix~\ref{app:dirmix}), the transition probability simplifies to
\begin{align*}
\mathbf{P}[s_t = i \mid s_{t-1} = j, \bm{K}_{t-1} = \bm{K}]
    &= \frac{\eta_{ji} + k_{ji}}{\sum_\ell (\eta_{j\ell} + k_{j\ell})}.
\end{align*}

\paragraph{Running-Time Analysis.}
The stars and bars theorem implies that $\bm{K}_t$ can take $\binom{t + N^2 - 1}{N^2 - 1}$ different values \citep{feller1968introduction}.
Thus, the total number of subproblems we need to solve is given by
\begin{align*}
\sum_{t=0}^{T} \binom{t + N^2 - 1}{N^2 - 1} = \binom{T + N^2}{N^2} = O \big( T^{N^2} \big),
\end{align*}
where the first equality follows from the hockey-stick identity \citep{jones1996generalized}, a special case of the Vandermonde identity.
Each subproblem involves a sum over $N$ terms, leading to an overall running time $O(NT^{N^2})$.
In the case where the admissible transitions are restricted to the graph $\mathcal{G} = ([N], \mathcal{E})$, a similar development shows that the running time reduces to $O(d_{\text{avg}} T^{\lvert \mathcal{E} \rvert})$, where $d_{\text{avg}}$ is the average node degree.
Even though this procedure is more efficient than enumerating all paths of length $T$, it remains impractical for all but the smallest problems.

\subsection{Convergence of Algorithm~\ref{alg:predict}}

We start by proving Proposition~\ref{thm:predict} in the main text, which we recall here for convenience.
\begin{proposition}
For any $\bm{A}, \bm{B}$, horizon $T$, and initial distribution $\bm{\pi}_0$, let $\hat{\bm{\pi}}_T$ be the output of Algorithm~\ref{alg:predict}.
Then, for any $\epsilon, \delta > 0$, we have
\begin{align*}
\mathbf{P}[\lVert \hat{\bm{\pi}}_T - \bm{\pi}^\star_T \rVert < \varepsilon] > 1 - \delta,
\end{align*}
as long as $L > \frac{11}{\varepsilon^2} \log \frac{N+1}{\delta}$.
\end{proposition}

\begin{proof}
% Note: Spectral norm is the same as frobenius norm if the matrix has rank one.
% we can upper bound epsilon by 2 (since this is the max distance between prob. vecs)
% Also useful: <https://math.stackexchange.com/questions/1959487>
The result follows from the matrix Bernstein inequality \citep[Thm. 1.6.2]{tropp2015introduction} applied to the random vectors $\{\bm{z}_1, \ldots, \bm{z}_L\}$, where $\bm{z}_\ell = \bm{\pi}_{\ell, T} - \bm{\pi}^\star_T$.
By construction, $\{ \bm{z}_\ell \}$ are jointly independent and $\mathbf{E}[\bm{z}_{\ell}] = 0$ for all $\ell$.
Furthermore, since $\bm{z}_\ell$ is a difference of two probability vectors, $\lVert \bm{z}_\ell \rVert \le 2$ for all $\ell$ and $\lVert \sum_\ell \bm{z}_\ell^\Tr \bm{z}_\ell \rVert = \lVert \sum_\ell \bm{z}_\ell \bm{z}_\ell^\Tr \rVert \le 4L$.
As a consequence, the matrix Bernstein inequality yields
\begin{align*}
\mathbf{P} \Big[ \Big\lVert \sum_\ell \bm{z}_\ell \Big\rVert \ge L \varepsilon \Big]
    \le (N\!+\!1) \cdot \exp \left( -\frac{L^2 \varepsilon^2 / 2}{4L + 2 L \varepsilon / 3} \right),
\end{align*}
and with some basic algebraic manipulations, we obtain the result as formulated in the proposition.
\end{proof}

Note that Proposition~\ref{thm:predict} holds for any algorithm that averages independent samples centered around the true state distribution.
However, we intuitively expect that, for a given budget of samples $L$, Algorithm~\ref{alg:predict} returns a better estimate than one obtained by naively sampling entire trajectories.
This is because Algorithm~\ref{alg:predict} first samples from the mixture distribution, and then averages over \emph{all} possible paths, instead of sampling a single path.
We verify this empirically in the next section.

\paragraph{Continuous-Time Algorithm.}
For completeness, we briefly review the continuous-time variant of the sampling procedure introduced in Section~\ref{sec:predict} of the main text.
We present the procedure in Algorithm~\ref{alg:predictct}.
The predictive state distribution of a CTMC sampled from the mixture distribution is computed on line~\ref{line:ctmcpred}.
In practice, the matrix exponential often cannot be computed exactly, but it can be approximated effectively \citep{almohy2010new}.
Most numerical libraries and machine-learning frameworks provide the matrix exponential as a primitive.\footnote{%
For example, \texttt{scipy.linalg.expm} in SciPy and \texttt{tf.linalg.expm} in TensorFlow.}

\begin{algorithm}[h]
  \caption{Predictive state distribution.}
  \label{alg:predictct}
  \begin{algorithmic}[1]
    \Require $\bm{A}, \bm{B}$, horizon $T$, init. dist. $\bm{\pi}_0$, num. samples $L$
    \For{$\ell = 1, \ldots, L$}
      \State $\bm{\Lambda} \gets$ sample from $\prod_{i \ne j} \Gamma(\bm{\lambda}_{ij} \mid \bm{\alpha}_{ij}, \bm{\beta}_{ij})$
      \State $\bm{\pi}_{\ell,T} \gets \bm{\pi}_0^\Tr e^{T \bm{\Lambda}}$ \label{line:ctmcpred}
    \EndFor
    \State \Return $\hat{\bm{\pi}}_T = \frac{1}{L} \sum_\ell \bm{\pi}_{\ell,T}$
  \end{algorithmic}
\end{algorithm}


\subsection{Empirical Convergence of Algorithm~\ref{alg:predict}}
\label{app:convergence}

A practical approach to computing the predictive state distribution for \emph{any} model is to sample a small set of trajectories and estimate the distribution empirically by using the samples.
We refer to this as the \emph{naive sampling} scheme.
In this section, we compare Algorithm~\ref{alg:predict} to the naive scheme in terms of the quality of the estimated distribution $\hat{\bm{\pi}}_T$, for a given budget of samples $L$.

We generate a synthetic problem instance as follows.
Setting the number of states to $N = 5$, we sample a matrix $\bm{H} \in [0, \rho]^{N \times N}$ uniformly at random, for $\rho \in \{1, 10, 100\}$.
We interpret this matrix as the parameters of a product of $N$ Dirichlet distributions, a special case of the $\mathrm{GDir}$ distribution (see Appendix~\ref{app:dirmix}).
Informally, the larger $\rho$ is, the more the mixture distribution is concentrated around a single DTMC transition matrix.
We then sample an initial state $i_0$ uniformly at random, let $\bm{\pi}_0 = [\mathbf{1}_{i = i_0}]$, and we estimate the predictive state distribution at horizon $T = 10$.
Even though we report results only on a specific experimental setting, our findings appear to be robust to different choices of $N$, $T$, and $\bm{\pi}_0$.
%We define the ground-truth distribution as that obtained by running Algorithm~\ref{alg:predict} with $10^5$ samples.
We compare the empirical estimate obtained by using $L = 1, ..., 10^3$ samples (collected through naive sampling or Algorithm~\ref{alg:predict}) to the ground truth by computing the $\ell_2$-norm of the difference vector.
For each value of $\rho$, we average the performance obtained on $M = 20$ instances and present the results in Figure~\ref{fig:convergence}.

\begin{figure}[h]
  \centering
  \includegraphics{fig/convergence}
  \caption{%
Mean $\pm$ std. (20 instances) of the distance $\lVert \hat{\bm{\pi}}_T - \bm{\pi}^\star_T \rVert$ as a function of $L$ for the naive sampling scheme and Algorithm~\ref{alg:predict}.
%We represent the mean and standard deviation computed over $20$ instances.
For naive sampling, the performance is nearly identical for every value of $\rho$ and we thus draw a single line.}
  \label{fig:convergence}
\end{figure}

In all cases, the $\ell_2$ distance to the ground truth appears to decrease as $1 / \sqrt{L}$.
This is expected: both our proposed approach and naive sampling rely on averaging independent samples centered around $\bm{\pi}^\star_T$.
However, we observe that Algorithm~\ref{alg:predict}, which samples from the mixing distribution but then averages over all paths, results in samples with lower variance.
This, in turn, leads to better estimates for any given sampling budget $L$.
In fact, Algorithm~\ref{alg:predict} requires $10$--$1000 \times$ fewer samples than naive sampling in order to reach a given accuracy.
The gains depend on the shape of the mixture distributions;
For $\rho = 1$ (strongly multimodal mixture distribution) the advantage is relatively modest, whereas for larger values of $\rho$ it becomes important.
