\section{Introduction}
\label{sec:intro}
The principal component analysis (PCA) model postulates that $N$ independent realisations of $K$-dimensional data ${\bf x}_i \in \mathbb{R}^K$ ($i=1,\ldots,N$) are described as
\begin{equation}
\label{eqn:mfa}
	{\bf x}_i = v_{i1} {\bf a}_1 + \cdots + v_{iJ} {\bf a}_J + \bm{\epsilon}_i = \left(\sum_{j=1}^J v_{ij} {\bf a}_j\right) + \bm{\epsilon}_i, \quad \bm{\epsilon} \sim {\rm N}_K ({\bf 0}_K, \sigma^2 {\bf I}_K),
\end{equation}
where $\{{\bf a}_1, \ldots, {\bf a}_J\}$ are the $J (< K)$ latent (unobserved) factor loadings with each factor loading ${\bf a}_j \in \mathbb{R}^K$, and ${\bf v}_j \sim {\rm N}({\bf 0}_N, {\bf I}_N)$ are the factor scores distributed as per the standard normal distribution. It is assumed that the residuals follow an isotropic zero mean normal distribution with the variance-covariance matrix $\sigma^2 {\bf I}_K$. Without loss of generality, we assume that the data has been centered to the mean. We can write this PCA model in matrix notation as follows
\begin{equation}
	{\bf x}_i = {\bf A} {\bf v}_i + \bm{\varepsilon}, \quad {\bf A} \in \mathbb{R}^{K \times J}, \quad {\bf v}_i \in \mathbb{R}^J, \quad \bm{\varepsilon} \sim N_K({\bf 0}, \sigma^2 {\bf I}_K),
\end{equation}
where $i = (1,\ldots,N)$, ${\bf A} = ({\bf a}_1, \ldots, {\bf a}_J)$ and ${\bf V} = ({\bf v}_1, \ldots, {\bf v}_N) \in \mathbb{R}^{J \times N}$. Integrating out the factor scores yields the multivariate Gaussian marginal distribution of the data 
\begin{equation}
\label{eqn:y:marginal}
    {\bf x}_i \sim N_K({\bf 0}_K, \bm{\Sigma}_0), \quad \bm{\Sigma} = {\bf A} {\bf A}^\prime + \sigma^2 {\bf I}_K .
\end{equation}
The probabilistic principal component model as described suffers from identifiability constraints~\citep{AndersonRubin56,LawleyMaxwell71}. A key reason for this is that the latent factors affect the likelihood function only through their outer product ${\bf A} {\bf A}^\prime$, which implies that an estimate of the factors can only be determined up to a rotation. To ensure the model is not over parametarised, the maximum number of latent factors to be estimated cannot exceed
\begin{equation}
    J_\text{MAX} \leq K + \frac{1}{2} \left(1-\sqrt{8k + 1}\right) ,
\end{equation}
see \citep{Beal03} (pp. 108) for details. For example, when $K = 4,5,6$ we have $J_\text{MAX} = 1, 2, 3$, respectively. \cite{TippingBishop99} showed how to interpret standard PCA model in a  probabilistic framework and obtained maximum likelihood estimates of the latent factors and residual variance. 

This manuscript examines estimation of the probabilistic PCA model under the Bayesian minimum message length (MML) inductive inference framework. Although single and multiple factor analysis has been examined within the MML framework by \cite{WallaceFreeman92} and \cite{Wallace98} respectively, this manuscript departs from the earlier work in the following:
\begin{itemize}
    \item We consider the marginal distribution of the data (\ref{eqn:y:marginal}) rather than the model (\ref{eqn:mfa}) analysed by \cite{Wallace98}. 
    %
    \item Using polar decomposition, we write the factor load matrix ${\bf A}$ as a product of an orthogonal matrix and a diagonal matrix representing the direction and length of the loadings, respectively. Unlike the earlier MML approaches, we parameterize the orthogonal matrix via Givens rotations to explicitly capture orthogonality constraints.
    %
    \item We use matrix polar decomposition to develop  a prior distribution for the latent factors ${\bf A}$ that is a product of a matrix variate Cauchy distribution and a uniform distribution over the corresponding Stiefel manifold.
    %
    \item We obtain analytic MML estimates of the parameters and find a  polynomial whose roots yield the MML estimate of the residual variance.
\end{itemize}
\section{Maximum likelihood estimation}
This section summarises the results of \cite{TippingBishop99}. The negative log-likelihood of the data under the probabilistic PCA model is 
\begin{equation}
\ell (\bm{\theta} ) = \frac{N K}{2} \log (2\pi) + \frac{N}{2} \log |\bm{\Sigma} | + \frac{N}{2} {\rm tr} \left(\bm{\Sigma}^{-1} {\bf S}_x\right)
\end{equation}
where ${\bf S}_x = \frac{1}{N}\sum_i {\bf x}_i {\bf x}_i^\prime$ is the sample variance-covariance matrix. We have the observed data ${\bf X}$ and wish to estimate the number of latent factors $J$ and all parameters $\bm{\theta} = \left\{{\bf A},\sigma^2\right\}$. 
Differentiating the negative log-likelihood with respect to the factor loads
\begin{align*}
 \partial \ell(\bm{\theta}) 
&= N {\rm tr} \, {\bf A}^\prime \bm{\Sigma}^{-1}(\partial {\bf A}) - N {\rm tr} \left( {\bf A}^\prime  \bm{\Sigma}^{-1} {\bf S}_x \bm{\Sigma}^{-1} (\partial {\bf A}) \right) 
\end{align*} 
and setting the derivatives to zero we get
\begin{align*}
    {\bf S}_x {\bf \Sigma}^{-1} {\bf A} &= {\bf A}
\end{align*}
Consider the singular value decomposition ${\bf A} = {\bf U} {\bf L} {\bf V}^\prime$, where ${\bf U} \in \mathbb{R}^{K\times J}$, ${\bf L} = {\rm diag}(l_1,\ldots,l_j)$ and ${\bf V} \in \mathbb{R}^{J\times J}$ is an orthogonal matrix. Noting that
    $\bm{\Sigma}^{-1} {\bf A} = {\bf U} {\bf L} ({\bf L}^2 + \sigma^2 {\bf I}_J )^{-1} {\bf V}^\prime$,
we have
\begin{align*}
{\bf S}_x {\bf U}   &= {\bf U}  ({\bf L}^2 + \sigma^2 {\bf I}_J ) \\
{\bf S}_x {\bf u}_j   &=   (l_j^2 + \sigma^2 ) {\bf u}_j, \quad (j=1,\ldots,J),
\end{align*}
which is an example of the eigenvalue problem. That is, ${\bf U}$ is a $(K\times J)$ matrix whose columns are the top
$J$ eigenvectors of the sample covariance matrix ${\bf S}_x$ corresponding to the $J$ largest eigenvalues 
\begin{eqnarray}
    \delta_j = \hat{l}^2_j + \sigma^2, \quad j=1,\ldots, J,
\end{eqnarray}
where $\hat{l}_j = (\delta_j - \sigma^2)^{\frac{1}{2}}$ is the $j$-th largest singular value of ${\bf A}$. Without loss of generality we assume that $\delta_1 > \delta_2 > \ldots > \delta_K > 0$ throughout the manuscript.
This implies that the maximum likelihood estimate is
\begin{eqnarray}
\hat{\bf A}_\text{ML} = {\bf U} (\bm{\Delta} - \sigma^2 {\bf I}_J )^\frac{1}{2} {\bf O}, \quad \bm{\Delta} = {\rm diag}(\delta_1, \ldots, \delta_J)
\end{eqnarray}
where ${\bf O}$ is an arbitrary (orthogonal) rotation matrix and $\bm{\Delta}$ is a diagonal matrix with the $J$-th largest eigenvalues of ${\bf S}_x$. Substituting the maximum likelihood estimate of the factor loads into the negative log-likelihood we have
\begin{align}
\ell (\sigma, \hat{\bf A}_{\text{ML}} ) = \frac{N K}{2} \log (2\pi) + \frac{N}{2} \sum_{i=1}^J \log \delta_j + \frac{N (K-J)}{2} \log \sigma^2 + \frac{N J}{2} + \frac{N}{2 \sigma^2} \sum_{j=J+1}^K \delta_j .
\end{align}
The concentrated negative log-likelihood is minimised by 
\begin{align}
    \hat{\sigma}^2 = \frac{1}{K-J} \sum_{j=J+1}^K \delta_j
\end{align}
which is the sum of the (K-J) smallest eigenvalues of the sample variance-covariance matrix. \cite{TippingBishop99} show that these estimates minimise the negative log-likelihood and discuss other saddle points of the log-likelihood function.
\section{Minimum message length analysis of the PCA model}
Under suitable regularity conditions~\citep{Wallace05}) (pp. 226), the MML87 codelength approximation for data ${\bf x}$ is
\begin{equation}
\label{eqn:mml87:codelength}
	\mathcal{I}_{87}({\bf x}, \bm{\theta}) = \underbrace{-\log \pi(\bm{\theta}) + \frac{1}{2} \log \abs{{\bf J}_{\bm{\theta}}(\bm{\theta})} + \frac{P}{2} \log \kappa_P}_{\rm assertion} + \underbrace{\frac{P}{2} - \log p({\bf x}|\bm{\theta})}_{\rm detail}
\end{equation}
where $\pi_{\bm{\theta}}(\bm{\theta})$ is the prior distribution for the parameters $\bm{\theta}$, $\abs{{\bf J}_{\bm{\theta}}(\bm{\theta})}$ is the determinant of the expected Fisher information matrix, $p({\bf x}|\bm{\theta})$ is the likelihood function of the model and $\kappa_P$ is a quantization constant~\citep{ConwaySloane98,AgrellEriksson98}; for small $P$ we have 
\begin{equation}
\kappa_1 = \frac{1}{12}, \quad \kappa_2 = \frac{5}{36 \sqrt{3}}, \quad \kappa_3 = \frac{19}{192 \times  2^{1/3}},
\end{equation}
while, for large $P$, $\kappa_P$ is well-approximated by~\citep{Wallace05}:
\begin{equation}
\frac{p}{2} (\log \kappa_p + 1) \approx -\frac{p}{2} \log 2\pi + \frac{1}{2} \log p \pi - \gamma,
\end{equation}
where $\gamma \approx 0.5772$ is the Euler--Mascheroni constant. 


\subsection{Orthogonality constraints}
As discussed in Section~\ref{sec:intro}, it is well-known that the PCA model is not identifiable given the data. A key reason for this is that the latent vectors affect the likelihood only through their outer product
$	{\bf A} {\bf A}^\prime = \sum_{j=1}^J {\bf a}_j {\bf a}^\prime_j. $
However, there are infinitely many sets of vectors that could generate the same matrix. To resolve this ambiguity, it is a convention to estimate the factor load vectors to be mutually orthogonal; that is,
\begin{equation}
    {\bf A}^\prime {\bf A} = \bm{\alpha} = {\rm diag}(\alpha^2_1, \ldots, \alpha^2_J), \quad  \alpha_j = ({\bf a}_j^\prime {\bf a}_j)^{\frac{1}{2}}, \quad (j=1,\ldots, J),
\end{equation}
where $\alpha_j$ denote the length of the $j$-th load vector. We enforce orthogonality constraints by parameterizing the matrix ${\bf A}$ in terms of Givens rotations ~\cite{PourzanjaniEtAl21}. Specifically, we can write ${\bf A}$ as 
\begin{align}
    {\bf A} &= \left[ R_{12}(\phi_{1,2}) \cdots R_{1,K}(\phi_{1,K}) R_{2,3}(\phi_{2,3}) \cdots R_{2,K}(\phi_{2,K}) \cdots R_{J,J+1}(\phi_{J,J+1}) \cdots R_{J,K}(\phi_{J,K}) {\bf I}_{K,J} \right] \bm{\alpha} \\
    %
    &= {\bf R} \, \bm{\alpha}
\end{align}
where ${\bf I}_{K,J}$ is the first $J$ columns of a $K \times K$ identity matrix and $R_{i,j}(\phi_{i,j})$ is a $(K \times K)$ rotation matrix that is equal to the identity matrix except for the $(i, i)$ and $(j, j)$ positions which are replaced by $\cos (\phi_{i,j})$, and the $(i, j)$ and $(j, i)$ positions which are replaced by $-\sin (\phi_{i,j})$ and
$\sin (\phi_{i,j})$ respectively. Thus ${\bf R} \in \mathbb{R}^{K \times J}$ and $\bm{\alpha} \in \mathbb{R}^{J \times J}$ denote the orientations and lengths of the factor load vectors, respectively. For example, when $K=J=2$, we have
\begin{equation}
    {\bf A} =
\left(
\begin{array}{cc}
 \cos \left(\phi_{1,2}\right) & -\sin \left(\phi_{1,2}\right) \\
 \sin \left(\phi_{1,2}\right) & \cos \left(\phi_{1,2}\right) \\
\end{array}
\right) 
\left(
\begin{array}{cc}
 1 & 0 \\
 0 & 1 \\
\end{array}
\right)
\left(
\begin{array}{c}
 \alpha _1 \\
 \alpha _2 \\
\end{array}
\right) .
\end{equation}
This parametarization explicitly takes into account that the estimated factor loads are pairwise orthogonal. The model parameters are now
\begin{itemize}
    \item the lengths of the latent factors $\bm{\alpha}=(\alpha_1, \ldots, \alpha_J) \in \mathbb{R}_+^J$,
    \item the orientation of the factor load vectors as captured by the $D = J K- J(J+1)/2$ angles 
\begin{equation*}
    \bm{\phi} = (\phi_{1,2},\ldots,\phi_{1,K}, \phi_{2,3}, \ldots, \phi_{2,K},\ldots,\phi_{J,J+1},\ldots,\phi_{J,K})
\end{equation*}
\item and the residual variance $\sigma^2 > 0$.
\end{itemize}

\subsection{Fisher information}
Following lengthy and tedious algebra, the expected Fisher information matrix is seen to be block diagonal with determinant
\begin{align}
| {\bf J}(\bm{\alpha},\sigma,\bm{\phi}) | &= N^{P}
    | {\bf J}(\bm{\alpha},\sigma) | \, | {\bf J}(\bm{\phi}) | \\
| {\bf J}(\bm{\alpha},\sigma) | &= \frac{2^{J+1} (K-J)}{{\sigma^2}} \prod _{j=1}^J \frac{\alpha_j^2}{\left(\alpha_j^2+\sigma ^2\right)^2} \\
| {\bf J}(\bm{\phi}) |
&=  |J_{{\bf A} \to \bm{\phi}}|^2 \left(\prod _{i=1}^J \left(\frac{\alpha_i^4}{\sigma ^2}\right){}^{K-J} \frac{1}{\left(\alpha_i^2+\sigma ^2\right){}^{K-1}}\right) \prod _{j<k} \left(\alpha_j^2-\alpha_k^2\right)^2 
\end{align}
where $|J_{{\bf A} \to \bm{\phi}}|$ is the transformation of measure under the Givens representation~\citep{PourzanjaniEtAl21}
\begin{equation*}
    |J_{{\bf A} \to \bm{\phi}}| = \prod_{i=1}^J \prod_{j=i+1}^K (\cos \phi_{i,j})^{j-i-1} ,
\end{equation*}
and $P = (D + J + 1)$ is the total number of free parameters. Combining all the terms we have
\begin{align}
    | {\bf J}(\bm{\alpha},\sigma,\bm{\phi}) | &=  N^{P} \frac{\left(2^{J+1} (K-J)\right)}{{\sigma^2}} |J_{{\bf A} \to \bm{\phi}}|^2 \left(\prod _{i=1}^J  \frac{\alpha_i^2}{\left(\alpha_i^2+\sigma ^2\right)^2} \left(\frac{\alpha_i^4}{\sigma ^2}\right){}^{K-J} \frac{1}{\left(\alpha_i^2+\sigma ^2\right){}^{K-1}}\right) \prod _{j<k} \left(\alpha_j^2-\alpha_k^2\right)^2 \nonumber \\
&=  \frac{N^{P} 2^{J+1} (K-J) |J_{{\bf A} \to \bm{\phi}}|^2}{{\sigma^{2(J(K-J)+1)}}} \prod_{i=1}^J  \frac{\alpha_i^{4(K-J)+2}}{\left(\alpha_i^2+\sigma ^2\right)^{K+1}} \prod _{j<k}^J \left(\alpha_j^2-\alpha_k^2\right)^2
\end{align}

\subsection{Prior information}
\label{sec:mfa:priors}
The prior distribution for the residual standard deviation $\bm{\Sigma}^{\frac{1}{2}}$ is chosen to be the scale-invariant density 
\begin{equation}
\label{eqn:prior:sigma}
	\pi_\sigma(\sigma) \propto \sigma^{-1},
\end{equation}
defined over some suitable range.

The prior distribution for the matrix of factor loads ${\bf A} \in \mathbb{R}^{K \times J}$ is not immediately obvious as the estimates of the factor loads are enforced to be mutually orthogonal. Ideally, we would like a prior distribution that is uniform over the direction of the $J$ factors, while the distribution of the lengths of these vectors should be heavy tailed to allow for a wide range of lengths. We follow a similar approach to \cite{Wallace98} and assume a prior distribution over the unknown latent vectors that is then transformed to account for the estimated factors being mutally orthogonal. Further, as in~\cite{Wallace98}, we shall consider a prior distribution for the scaled factors
\begin{align*}
{\bf b}_j = \left( \frac{{\bf a}_j}{\sigma} \right), \quad \beta_j = ({\bf b}_j^\prime {\bf b}_j)^{\frac{1}{2}}, \quad (j=1,\ldots,J)
\end{align*}
where the residual variance is used as a default scale. Let $\tilde{\bf B} \in \mathbb{R}^{K \times J}$ denote the matrix containing the $J$ true (unknown) scaled factors. We assume $\tilde{\bf B}$ to follow a matrix variate Cauchy distribution~\citep{BandekarDaya03} with probability density function
\begin{equation}
	\pi_{\tilde{A}}(\tilde{\bf B}) = \frac{\Gamma_K((K+J)/2)}{\pi^{K J / 2} \Gamma_K(K/2)} {\rm det}({\bf I}_K + \tilde{\bf B} \tilde{\bf B}^\prime)^{-(K+J)/2}.
\end{equation}
This is a reasonable choice as the matrix variate Cauchy is spherically symmetric and has appropriately heavy tails. Further, our choice of the prior distribution implies that $\tilde{\bf B}^\prime \in \mathbb{R}^{J \times K}$ follows a matrix variate Cauchy distribution with density
\begin{equation}
	\pi_{\tilde{B}^\prime}(\tilde{\bf B}^\prime) = \frac{\Gamma_J((K+J)/2)}{\pi^{K J / 2} \Gamma_J(J/2)} {\rm det}({\bf I}_J + \tilde{\bf B}^\prime \tilde{\bf B})^{-(K+J)/2} .
\end{equation}
Consider the unique matrix polar decomposition
\begin{equation}
	\tilde{\bf B}^\prime = {\bf W}_B^{\frac{1}{2}} \, {\bf H}_B, \quad {\bf W}_B = \tilde{\bf B}^\prime \tilde{\bf B}, \quad {\bf H}_B = (\tilde{\bf B}^\prime \tilde{\bf B})^{-\frac{1}{2}} \tilde{\bf B}^\prime
\end{equation}
where ${\bf H}_B$ is defined over the Stiefel manifold $\mathcal{V}_J (\mathbb{R}^K )$ and ${\bf W}_B$ is a symmetric positive definite matrix. We may think of the matrix ${\bf H}_B$ as the orientation matrix, while the matrix ${\bf W}_B$ determines the squared lengths of the true scaled latent vectors. If $\tilde{\bf B}^\prime$ follows a matrix variate Cauchy distribution, it is known that ${\bf H}_B$ is distributed uniformly over the Stiefel manifold with density function~\citep{BandekarDaya03}:
\begin{equation}
	\pi_H({\bf H}_B) = \frac{1}{{\rm Vol}( \mathcal{V}_{J} (\mathbb{R}^K ) )}, \quad {\rm Vol}(\mathcal{V}_J (\mathbb{R}^K  )  ) = \frac{ 2^J \pi^{K J / 2} } { \Gamma_J (K/2)} ,
\end{equation}
where $\Gamma_p( y )$ is the multivariate Gamma function
\begin{align*}
    \Gamma_J(y) = \pi^{J(J-1)/4} \prod_{j=1}^J \Gamma(y + (1-j)/2) .
\end{align*}
Further, the random variable ${\bf W}_B$ representing the squared lengths of the true scaled factors is independent of ${\bf H}_B$ with probability density function~\cite{BandekarDaya03}
\begin{eqnarray}
	\pi_W({\bf W}_B) 
&\propto&  {\rm det}( {\bf W}_B )^{(K - J  - 1)/2}  {\rm det}({\bf I}_K + {\bf W}_B)^{-(K+J)/2}
\end{eqnarray}
which is a matrix variate beta type II distribution ${\bf W}_B \sim B_J^{II}(K/2, J/2)$ with parameters $(K/2, J/2)$ (see~\cite{GuptaNagar99}, pp. 166, for further details); this is also known as the matrix variate $F$ distribution. Recall that the estimated (scaled) factor load vectors obey
\begin{equation}
	{\bf S}_B = \tilde{\bf B}\tilde{\bf B}^\prime  = \sum_{j=1}^J \tilde{\bm{\beta}}_j \tilde{\bm{\beta}}_j^\prime = \sum_{j=1}^J \bm{\beta}_j \bm{\beta}_j^\prime = {\bf B} {\bf B}^\prime , \quad \bm{\beta}_j^\prime \bm{\beta}_{k \neq j} = 0 . 
\end{equation}
where ${\bf S}_B$ is a $(K \times K)$ symmetric matrix of rank $J$. The distribution of the squared scaled lengths $\beta_j^2$ of the estimated latent vectors can then be taken as the joint distribution of the $J$ eigenvalues of ${\bf S}_B$ which is  (see Appendix~F) 
\begin{align}
	\pi_{\bm{\beta}^2}(\beta_1^2,\ldots,\alpha_J^2) &= \frac{\pi^{J^2/2}}{\Gamma_J(J/2) \mathcal{B}_J(K/2,J/2)}\prod_{j=1}^J \beta_j^{(K-J-1)} (1 + \beta_j^2)^{-(K+J)/2} \prod_{j<k}^J | \beta_j^2 - \beta_k^2| , \nonumber 
\end{align}
where $\mathcal{B}_p(a,b)$ denote the multivariate beta function
\begin{align*}
	\mathcal{B}_J(a,b) = \frac{\Gamma_J(a) \Gamma_J(b)}{\Gamma_J(a+b)}.
\end{align*}
This implies that the prior distribution of the lengths of the scaled latent factors is
\begin{align}
	\pi_{\bm{\beta}}(\beta_1,\ldots,\beta_J) &= 	\frac{2^J \pi^{J^2/2}}{\Gamma_J(J/2) \mathcal{B}_J(K/2,J/2)}\prod_{j=1}^J \beta_j^{(K-J)} (1 + \beta_j^2)^{-(K+J)/2} \prod_{j<k}^J | \beta_j^2 - \beta_k^2| .
\end{align}
Finally, the prior distribution for the lengths of the (unscaled) latent factors is
\begin{align}
	\pi_{\bm{\alpha}}(\alpha_1,\ldots,\alpha_J) &= 	\frac{2^J \pi^{J^2/2} \sigma ^{J^2}}{\Gamma_J(J/2) \mathcal{B}_J(K/2,J/2) }\prod_{j=1}^J \alpha_j^{(K-J)} (\sigma^2 + \alpha_j^2)^{-(K+J)/2} \prod_{j<k}^J | \alpha_j^2 - \alpha_k^2| .
\label{eqn:prior:len:b}
\end{align}
The complete prior distribution over all model parameters is 
\begin{align}
     \pi(\bm{\alpha},\sigma,\bm{\phi}) =  \pi_\sigma(\bm{\Sigma}^{\frac{1}{2}}) \, \pi_{\bm{\alpha}}(\alpha_1,\ldots,\alpha_J)   |J_{{\bf A} \to \bm{\phi}}| \, J! ,
\end{align}
where the term $J!$ is included because the labelling of the latent factors is arbitrary and $|J_{{\bf A} \to \bm{\phi}}|$ is the transformation of measure from the matrix parametrization ${\bf A}$ to the orthogonality-preserving parameterization based on Givens rotations. 


\subsection{Codelength}
Omitting constants, the \cite{WallaceFreeman87} codelength for the probabilistic PCA model is
\begin{align*}
\mathcal{I} &\propto \frac{N}{2} \log |\bm{\Sigma}| + \frac{N}{2} {\rm tr} \left( \bm{\Sigma}^{-1} {\bf S}_x\right) - K J  \log (\sigma ) + \frac{1}{2} \sum_{j=1}^J \log \left[\alpha_j ^{2 (K-J+1)} \left(\alpha_j ^2+\sigma^2\right)^{(J-1)} \right] 
\end{align*}
where ${\bf S}_x = \frac{1}{N}\sum_i {\bf x}_i {\bf x}_i^\prime$ is the sample variance-covariance matrix. To obtain MML estimates, we start with the Langrangian of the factor orientations 
\begin{align*}
\psi({\bf R}) &=
\log |\bm{\Sigma} | + {\rm tr} \left(\bm{\Sigma}^{-1} {\bf S}_x\right) - {\rm tr} {\bf L} ({\bf R}^\prime {\bf R} - I) ,
\end{align*}
where ${\bf L}$ is a $J \times J$ symmetric matrix of Lagrange multipliers. Clearly, minimising $\psi({\bf R})$ is equivalent to minimising the codelength with respect to ${\bf R}$. The first differential of the Lagrangian is
\begin{align*}
 \partial \psi({\bf R})
&= 2 \text{tr} \left[\bm{\alpha} {\bf A}^\prime \left( \bm{\Sigma}^{-1} - \bm{\Sigma}^{-1} {\bf S}_x \bm{\Sigma}^{-1}\right) (d{\bf R})\right] - 2 \text{tr} \left( {\bf L} {\bf R}^\prime (d{\bf R}) \right) ,
\end{align*} 
which implies the following first order conditions
\begin{align}
\bm{\alpha} {\bf A}^\prime \left( \bm{\Sigma}^{-1} - \bm{\Sigma}^{-1} {\bf S}_x \bm{\Sigma}^{-1}\right) &= {\bf 0} \label{eqn:lag:cond1}\\
{\bf L} {\bf R}^\prime &= {\bf 0} \label{eqn:lag:cond2} \\
    {\bf R}^\prime {\bf R} &= {\bf I}_J \label{eqn:lag:cond3}
\end{align}
From (\ref{eqn:lag:cond2}) we have that ${\bf L} = {\bf 0}$ and from (\ref{eqn:lag:cond1}) 
\begin{align*}
     {\bf S}_x {\bf R} &= {\bf R} \, \text{diag} \left( \sigma^2 + \alpha_1^2, \ldots, \sigma^2 + \alpha_J^2\right) \\
     {\bf S}_x {\bf r}_j &= {\bf r}_j (\sigma^2 + \alpha_j^2), \quad (j=1,\ldots,J) .
\end{align*}
We see that, at the codelength minimum, the MML estimate of the factor orientations is the matrix ${\bf R}$ whose columns are the top $J$ eigenvectors of the variance--covariance matrix ${\bf S}_x$ with eigenvalues $\delta_j = (\sigma^2 + \alpha_j^2)$, for $j = 1,\ldots J$. This is identical to the corresponding maximum likelihood estimate. 
The concentrated codelength, as a function of $\sigma^2$ is 
\begin{align}
\mathcal{I}(\sigma)  
&\propto
    \frac{N}{2} \log \left( (\sigma^2)^{K-J} \prod_{j=1}^J (\alpha_j^2 + \sigma^2) \right) + \frac{N}{2 \sigma^2} \left(\sum_{j=1}^K \delta_j\right) - \frac{N}{2 \sigma^2} \sum_{j=1}^J \alpha_j^2 \nonumber \\
 & \quad  - K J  \log (\sigma ) + \frac{1}{2} \sum_{j=1}^J \log \left[\alpha_j ^{2 (K-J+1)} \left(\alpha_j ^2+\sigma^2\right)^{(J-1)} \right]   \nonumber  \\
 %
 &= \frac{N (K-J) - K J}{2} \log \left( \sigma^2 \right) + \frac{N}{2 \sigma^2} \left(\sum_{j=1}^K \delta_j\right) - \frac{N}{2 \sigma^2} \sum_{j=1}^J (\delta_j - \sigma^2) + \frac{(K-J+1)}{2} \sum_{j=1}^J \log \left(\delta_j - \sigma^2 \right) \label{eqn:msglen:conc}
\end{align}
The following theorem shows how to obtain the MML estimate of the residual variance from the concentrated message length.
\begin{thm}
\label{thm:mmlest}
Let $\tau = \sigma^2$. The concentrated codelength (\ref{eqn:msglen:conc}) has $(J+1)$ stationary points whose location are the roots of the $n=(J+1)$-degree polynomial
\begin{equation*}
    a_n \tau^n + a_{n-1} \tau^{n-1} + \cdots + a_1 \tau + a_0, \quad (0 < \tau < \delta_J)
\end{equation*}
with coefficients
\begin{align*}
a_0 &= -  \hat{\tau}_{{\rm ML}} e_J, \\   
a_j &= (-1)^{j+1}  \left[  \hat{\tau}_{{\rm ML}} \, e_{J-j} + \left(1-\frac{(j-1) (J-1)}{N (K-J)}-\frac{K (J-j+1)}{N (K-J)}\right) e_{J-j+1}  \right], \quad (1 \leq j \leq J) \\
    %
a_n &= (-1)^J \left[1 - \frac{J(J-1)}{N(K-J)}\right] \\    
\end{align*}
where the terms $e_t$ refer to elementary symmetric polynomials $e_t(\delta_1,\ldots,\delta_J)$ in $J$ variables $(\delta_1, \ldots, \delta_J)$. For example, for $J=3$, we have the following four elementary symmetric polynomials
\begin{align*}
    e_0(\delta_1,\delta_2,\delta_3) &= 1 \\
    e_1(\delta_1,\delta_2,\delta_3) &= \delta_1 + \delta_2 + \delta_3 \\
    e_2(\delta_1,\delta_2,\delta_3) &= \delta_1 \delta_2 + \delta_1 \delta_3 + \delta_2 \delta_3 \\
    e_3(\delta_1,\delta_2,\delta_3) &= \delta_1 \delta_2 \delta_3 .
\end{align*}
MML estimate of the residual variance is the stationary  point in the domain $0 < \tau < \delta_J$ that yields the shortest codelength. MML estimates of the factor lengths can be obtained from  $\hat{\alpha}_j = (\delta_j - \hat{\sigma}^2)^{\frac{1}{2}}$ for all $j = 1,\ldots,J$. %
\end{thm}

Importantly, all $(J+1)$ elementary symmetric polynomials can be efficiently computed in $O(J \log^2 J)$ time by Fast Fourier Transform (FFT) polynomial multiplication, following an algorithm widely attributed to Ben-Or. 

{\bf Example 1:} For a single latent factor ($J=1$), the stationary points of the concentrated codelength are the roots of the quadratic polynomial in $\tau$:
\begin{align}
- \delta_1 \hat{\tau}_{\text{ML}} + \left(\hat{\tau}_{\text{ML}} + c \delta_1 \right) \tau - \tau ^2=0, \quad c= 1-\frac{K}{N (K-1) },
\end{align}
given by
\begin{align*}
\frac{1}{2}\left( \hat{\tau}_{\text{ML}} + c \, \delta_1 \pm \Delta^{\frac{1}{2}}\right), \quad \Delta = c^2 \delta _1^2+2 (c-2) \delta _1 \hat{\tau}_{\text{ML}}+\hat{\tau} _{\text{ML}}^2 .
\end{align*}
The quadratic polynomial has no real roots if:
\begin{align}
-\frac{c+2 \sqrt{1-c}-2}{c^2}<\frac{\delta_1}{\hat{\tau}_{\text{ML}}}<\frac{-c+2 \sqrt{1-c}+2}{c^2} \label{eqn:mml:1pc}
\end{align}
in which case the MML solution is the uncorrelated multivariate Gaussian model. For example, when $N = 25$ and $K = 4$, the quadratic will have no real roots if
\begin{align*}
    0.219 < \frac{\delta_1}{\delta_2 + \delta_3 + \delta_4} < 0.564 .
\end{align*}
In the limit as $N \to \infty$, we have
\begin{align*}
    \lim_{N \to \infty} \pm \frac{c+2 \sqrt{1-c}-2}{c^2} = 1 .
\end{align*}
so that both the lower and upper bound in (\ref{eqn:mml:1pc}) approach 1.

{\bf Example 2:} For a PCA model with two latent factors ($J=2$), the stationary points of the concentrated codelength are the roots of the cubic polynomial in $\tau$:
\begin{align*}
    - \delta_1 \delta_2 \hat{\tau}_{\text{ML}} +  \left( (\delta_1 + \delta_2) \hat{\tau}_{\text{ML}} + c_1 \delta_1 \delta_2 \right)\tau- \left(\hat{\tau}_{\text{ML}} + \left(c_0 + c_1\right) (\delta_1 + \delta_2)\right)\tau ^2 + (2 c_0+c_1) \tau ^3
\end{align*}
where the constants
\begin{align*}
    c_0 = \frac{K-1}{N(K-2) }, \quad c_1 = 1-\frac{2 K}{N(K-2) } ,
\end{align*}
depend only on $N$ and $K$.


\section{Experiments}
\subsection{Parameter estimation}
\label{sec:exp:pest}
\begin{table*}[tbph]
\scriptsize
\begin{center}
\begin{tabular}{ccccccccccccccc} 
\toprule
$N$ & $K$ & $J$ & \multicolumn{2}{c}{$S_1$} & & \multicolumn{2}{c}{$S_2$} & & \multicolumn{2}{c}{KL Divergence} \\
    &     &     & MLE & MML87 & ~ & MLE & MML87 & ~ & MLE & MML87 \\
\cmidrule{1-11}
\multirow{9}{*}{25} & \multirow{3}{*}{5} & 1 & -0.072 & {\bf -0.023}  &  &  0.011 & {\bf  0.007}  &  &  0.167 & {\bf  0.130} \\ 
 & & 2 & -0.093 & {\bf  0.001}  &  &  0.021 & {\bf  0.009}  &  &  0.279 & {\bf  0.179} \\ 
 & \multirow{3}{*}{8} & 1 & -0.069 & {\bf -0.030}  &  &  0.008 & {\bf  0.004}  &  &  0.244 & {\bf  0.194} \\ 
 & & 2 & -0.120 & {\bf -0.021}  &  &  0.018 & {\bf  0.005}  &  &  0.463 & {\bf  0.285} \\ 
 & & 4 & -0.103 & {\bf  0.038}  &  &  0.020 & {\bf  0.007}  &  &  0.647 & {\bf  0.353} \\ 
 & \multirow{3}{*}{16} & 1 & -0.066 & {\bf -0.034}  &  &  0.006 & {\bf  0.003}  &  &  0.407 & {\bf  0.332} \\ 
 & & 2 & -0.111 & {\bf -0.040}  &  &  0.014 & {\bf  0.003}  &  &  0.818 & {\bf  0.557} \\ 
 & & 4 & -0.207 & {\bf -0.012}  &  &  0.045 & {\bf  0.003}  &  &  1.809 & {\bf  0.764} \\ 
\vspace{-2mm} \\ 
\cmidrule{2-11}
\vspace{-2mm} \\ 
\multirow{9}{*}{50} & \multirow{3}{*}{5} & 1 & -0.035 & {\bf -0.009}  &  &  0.004 & {\bf  0.003}  &  &  0.074 & {\bf  0.065} \\ 
 & & 2 & -0.056 & {\bf  0.006}  &  &  0.008 & {\bf  0.004}  &  &  0.127 & {\bf  0.096} \\ 
 & \multirow{3}{*}{8} & 1 & -0.034 & {\bf -0.012}  &  &  0.003 & {\bf  0.002}  &  &  0.113 & {\bf  0.100} \\ 
 & & 2 & -0.060 & {\bf -0.008}  &  &  0.005 & {\bf  0.002}  &  &  0.208 & {\bf  0.159} \\ 
 & & 4 & -0.069 & {\bf  0.034}  &  &  0.009 & {\bf  0.004}  &  &  0.324 & {\bf  0.209} \\ 
 & \multirow{3}{*}{16} & 1 & -0.033 & {\bf -0.015}  &  &  0.002 & {\bf  0.001}  &  &  0.209 & {\bf  0.186} \\ 
 & & 2 & -0.056 & {\bf -0.018}  &  &  0.004 & {\bf  0.001}  &  &  0.401 & {\bf  0.323} \\ 
 & & 4 & -0.107 & {\bf -0.010}  &  &  0.012 & {\bf  0.001}  &  &  0.785 & {\bf  0.489} \\ 
\vspace{-2mm} \\ 
\cmidrule{2-11}
\vspace{-2mm} \\ 
\multirow{9}{*}{100} & \multirow{3}{*}{5} & 1 & -0.017 & {\bf -0.004}  &  &  0.002 & {\bf  0.001}  &  &  0.033 & {\bf  0.031} \\ 
 & & 2 & -0.031 & {\bf  0.004}  &  &  0.003 & {\bf  0.002}  &  &  0.058 & {\bf  0.049} \\ 
 & \multirow{3}{*}{8} & 1 & -0.016 & {\bf -0.005}  &  &  0.001 & {\bf  0.001}  &  &  0.051 & {\bf  0.048} \\ 
 & & 2 & -0.029 & {\bf -0.002}  &  &  0.002 & {\bf  0.001}  &  &  0.095 & {\bf  0.082} \\ 
 & & 4 & -0.049 & {\bf  0.021}  &  &  0.004 & {\bf  0.002}  &  &  0.159 & {\bf  0.118} \\ 
 & \multirow{3}{*}{16} & 1 & -0.016 & {\bf -0.006}  &  &  0.001 & {\bf  0.000}  &  &  0.100 & {\bf  0.094} \\ 
 & & 2 & -0.027 & {\bf -0.007}  &  &  0.001 & {\bf  0.000}  &  &  0.193 & {\bf  0.171} \\ 
 & & 4 & -0.054 & {\bf -0.004}  &  &  0.003 & {\bf  0.000}  &  &  0.367 & {\bf  0.283} \\ 
\vspace{-3mm} \\ 
\bottomrule
\vspace{+1mm}
\end{tabular}
\caption{Performance metrics for maximum likelihood (MLE) and MML87 estimates of residual variance $\sigma^2$ computed over $10^5$ simulations. \label{tab:results:pest}}
\end{center}
\end{table*}
This section compares the newly derived MML parameter estimates for the probabilistic PCA model to the standard approach based on the maximum likelihood estimator. Since MML and maximum likelihood estimates of the the factor lengths (for a given $\sigma^2$) and factor orientations are identical, the key difference between to two approaches is in the estimation of the residual variance. Our simulation experiments are loosely based on Section~6 in \citep{Wallace98}. We conducted $10^5$ simulations for each combination of the sample size $N \in \{25,50,100\}$, the dimensionality of the data $K \in \{5, 8, 16\}$ and the number of estimated latent factors $J \in \{1,2,4\}$. As both maximum likelihood and MML are scale invariant, the residual variance was set to $\sigma^2 = 1$ and the factors lengths were $\alpha_j = 1$ ($j=1,\ldots,J$) for each simulation run, without loss of generality. The factor directions were randomly sampled from a unit $K$-sphere. 

We used the three performance metrics discussed in \citep{Wallace98} to evaluate the estimators:
\begin{align*}
    S_1 = \log \hat{\sigma}_i , \quad S_2 = (\log \hat{\sigma}_i)^2 ,
\end{align*}
and the Kullback--Leibler (KL) divergence~\citep{KullbackLeibler51} between two multivariate Gaussian distributions
\begin{equation*}
    \text{KL}(\bm{\Sigma}_0, \bm{\Sigma}_1) = \frac{1}{2} \left(\text{tr}\left( \bm{\Sigma}^{-1}_1 \bm{\Sigma}_0\right) + \log \left( \frac{ \bm{|\Sigma}_1|}{|\bm{\Sigma}_0|}\right) - K \right) .
\end{equation*}
The first metric $S_1$ is a measure of bias, while $S_2$ measures error in any direction. Both $S_1$ and $S_2$ are zero for exact estimates as the true residual variance was $\sigma^2 = 1$ in all experiments. The error measures were specifically chosen as they do not depend on the number of estimated latent vectors $J$. Simulation results averaged over $10^5$ iterations are shown in Table~\ref{tab:results:pest}. 

The MML estimate of the residual variance was found to be superior to the usual maximum likelihood estimate for all tested combinations of sample sizes, data dimensionality and the number of latent vectors. Maximum likelihood appeared to underestimate the residual variance more strongly compared to the minimum message length estimate. The differences in the performances of the two estimates were most pronounced when the sample size and data dimensionality was small ($N \leq 50$, $K = 5$). 
\subsection{Model selection}
We have also compared the performance of MML model selection against the highly popular  Bayesian information criterion (BIC) and Laplace's method for approximating the marginal distribution of the data~\citep{Minka00}, referred to as `Bayes' henceforth. Using numerical experiments, \cite{Minka00} demonstrated that approximating Bayesian evidence is superior to methods like cross validation. The simulation setup was identical to Section~\ref{sec:exp:pest} except the sample size  was $N \in \{50,100\}$, the dimensionality of the data $K = 10$ and the number of estimated latent factors $J \in \{1,2,4\}$. Simulation results, averaged over $10^5$ iterations, are shown in Table~\ref{tab:results:msel}. As expected, both MML and the Bayes method have similar performance and both improve significantly over the popular BIC criterion.
\begin{table*}[tbph]
\scriptsize
\begin{center}
\begin{tabular}{cccccccc} 
\toprule
$N$ & $J$ & Method & KL Divergence & & \multicolumn{3}{c}{Model Selection (\%)} \\
    &     &     &        &~ & $<J$ & $=J$ & $>J$ \\
\cmidrule{1-8}
\multirow{9}{*}{50} & \multirow{3}{*}{1} & MML &  0.060   &  &  -- & 99.55 &  0.45\\ 
 & & BIC &  0.063   &  &  -- & 100.00 &  0.00\\ 
 & & Bayes &  0.066   &  &  -- & 97.19 &  2.81\\ 
 & \multirow{3}{*}{2} & MML &  0.126   &  & 70.40 & 28.32 &  1.27\\ 
 & & BIC &  0.145   &  & 96.40 &  3.60 &  0.00\\ 
 & & Bayes &  0.129   &  & 52.52 & 45.17 &  2.31\\ 
 & \multirow{3}{*}{4} & MML &  0.198   &  & 81.00 &  5.78 & 13.22\\ 
 & & BIC &  0.257   &  & 99.99 &  0.01 &  0.00\\ 
 & & Bayes &  0.209   &  & 91.09 &  7.92 &  0.98\\ 
\vspace{-2mm} \\ 
\cmidrule{2-8}
\vspace{-2mm} \\ 
\multirow{9}{*}{100} & \multirow{3}{*}{1} & MML &  0.028   &  &  -- & 99.74 &  0.26\\ 
 & & BIC &  0.029   &  &  -- & 100.00 &  0.00\\ 
 & & Bayes &  0.030   &  &  -- & 97.86 &  2.14\\ 
 & \multirow{3}{*}{2} & MML &  0.062   &  & 30.47 & 68.49 &  1.05\\ 
 & & BIC &  0.094   &  & 74.75 & 25.25 &  0.00\\ 
 & & Bayes &  0.060   &  & 17.96 & 79.84 &  2.20\\ 
 & \multirow{3}{*}{4} & MML &  0.096   &  & 66.32 & 22.70 & 10.98\\ 
 & & BIC &  0.166   &  & 99.48 &  0.52 &  0.00\\ 
 & & Bayes &  0.103   &  & 72.71 & 25.93 &  1.36\\ 
\vspace{-3mm} \\ 
\bottomrule
\vspace{+1mm}
\end{tabular}
\caption{Model selection simulation results for minimum message length (MML), Bayesian information criterion (BIC) and Laplace's method for estimating Bayesian evidence averaged over $10^5$ simulations. In all experiments, data dimensionality was $K=10$. \label{tab:results:msel}}
\end{center}
\end{table*}


