

\section{MISSING PROOFS}

We provide proofs for all statements in \S3.

\subsection{CAVI rule}

Lemma~3.1 follows from the coordinate-ascent VI update rule for F-VI \cite[Eq. 17]{Blei:2017}, which tells us how to choose $q(z_n \s \nu_n)$ to minimize the KL-divergence, while maintaining the other factors in the approximating distribution fixed.
Specifically, suppose $\nu_0$ and $\boldsymbol \nu_{-n}$ are fixed.
Then the optimal variational parameter $\nu^\star_n$ for $n^\text{th}$ factor verifies
%
\begin{equation}
  q(z_n \s \nu_n^\star) \propto \exp \left \{
    \EE{q(\theta \s \nu_0)}{
      \EE{q(\mbz_{-n} \s \nu)}{
        \log p(\theta, \mbz, \mbx)}} \right \}.
\end{equation}
%
We now apply this rule to the optimal solution, i.e. we set $\nu_0 = \nu_0^*$ and $\boldsymbol \nu_{-n} = \boldsymbol \nu^*_{-n}$.
Then, minimizing the KL-divergence, $\nu_n^\star = \nu^*_n$ and the desired result follows.
\qed

\subsection{Existence of an ideal inference function and simple hierarchical models}

Theorem 3.4 states that the existence of an ideal inference function for a standard latent variable model (Definition~3.2) is, in general, equivalent to $p(\theta, \mbz, \mbx)$ being a simple hierarchical model (Eq.~1).

We first prove item (1). Suppose $p(\theta, \mbz, \mbx)$ is a simple hierarchical model.
Applying the CAVI rule (Lemma 3.1) to Eq. 1,
  \begin{align*}
    q(z_n \s \nu^*) & \propto  \exp \left \{ \mathbb E_{q(\theta \s \nu_0^*)} \left [ \mathbb E_{q(\mbz_{-n} \s \nu^*)} \left [ \log p({\theta}) + \sum_{j = 1}^n \log p(z_j \mid \theta) + \log p(x_j \mid z_j, \theta) \right ] \right ] \right \} \\
             & \propto  \exp \left \{ \mathbb E_{q(\theta \s \nu_0^*)} \left [ \mathbb E_{q(\mbz_{-n} \s \nu^*)} \left [ \log p(z_n \mid \theta) + \log p(x_n \mid z_n, \theta) \right ] \right ]  \right \}  \\
             & \propto \exp \left \{ \mathbb E_{q(\theta \s \nu_0^*)} \left [\log p(z_n \mid \theta) + \log p(x_n \mid z_n, \theta \right ] \right \}.
  \end{align*}
%
Then
\begin{equation}  \label{eq:q-simple-hier}
  q(z_n \s \nu^*) = k_{\mbx} (x_n) \int_\Theta q(\theta \s \nu^*_0(\mbx)) \log p(z_n \mid \theta) + \log p(x_n \mid z_n, \theta) \text d \theta,
\end{equation}
%
where $k_{\mbx} (x_n) = \left [\int_{\mathcal Z} \int_\Theta q(\theta \s \nu^*_0(\mbx)) \log p(z_n \mid \theta) + \log p(x_n \mid z_n, \theta) \text d \theta \text d z_n \right]^{-1}$ is a normalizing constant.
The R.H.S of \Cref{eq:q-simple-hier} defines an ideal inference function $f_{\mbx} (x_n)$, in the sense that, given $\mbx$, we have $x_n = x_m \implies f_{\mbx}(x_n) = f_{\mbx}(x_m)$.

Next we prove the converse, which is item (2) of Theorem~3.3.
Applying the CAVI rule to a standard latent variable model,
%
\begin{align}  \label{eq:cavi-zn}
  q(z_n \s \nu^*) & \propto \exp \left \{ \mathbb E_{q(\theta, \mbz_{-n} \s \boldsymbol \nu^*_{-n})} \log p(\theta, \mbz, \mbx) \right \} \nonumber \\
  & \propto \exp \left \{ \mathbb E_{q(\theta, \mbz_{-n} \s \boldsymbol \nu^*_{-n})} \log p(z_n \mid \mbz_{-n}, \theta) + \log p(x_n \mid z_n, \mbz_{-n}, \theta) + \sum_{i \neq n} \log p(x_i \mid z_n, \mbz_{-n}, \theta)  \right \}.
\end{align}
%
The last equation highlights all the terms in which $z_n$ appears.
Furthermore, we used the property of \textit{conditional independence} (Definition~3.2 (ii)) to break up the log likelihood $\log p(\mbx \mid \mbz, \theta)$ into a sum.

Suppose now that there exists a graph $\mathcal G$, such that for \textit{any} standard latent variable model supported by this graph, there exists an ideal inference function, that is $\nu^*_n = f_{\mbx} (x_n)$.
Because $q$ is parametric, we have that the R.H.S of \Cref{eq:cavi-zn} is also a (dataset dependent) function of $x_n$.
For this assumption to hold \textit{for any choice of distribution}, any contribution of $x_{i \neq n}$ that is not common to all the variational factors of $q(\mbz)$ must be absorbed into the normalizing constant and effectively vanish.
We will complete the proof by removing unique contributions of $x_i$ and severing offending edges in $\mathcal G$ (Figure~\ref{fig:graphG}).

\begin{figure}
      \centering
      \begin{tikzpicture}
    [
      Empty/.style={circle, draw=white!, fill=green!0, thick, minimum size=1mm},
      Round/.style={circle, draw=black!, fill=green!0, thick, minimum size=10mm},
    ]

    % Simple hierarchical
    \node[Round] (z_1) at (-1, 0){$z_{n - 1}$};
    \node[Round] (z_2) at (3, 0){$z_n$};
    \node[Round] (x_1) at (-1, -1.5){$x_{n - 1}$};
    \node[Round] (x_2) at (3, -1.5){$x_n$};
    \node[Round] (theta) at (1, -3) {$\theta$};

    \path[->, draw] (z_1) -- (x_1);
    \path[->, draw] (z_2) -- (x_2);
    \path[->, draw] (theta) -- (x_1);
    \path[->, draw] (theta) -- (x_2);
    \path[->, draw] (theta) -- (z_1);
    \path[->, draw] (theta) -- (z_2);
    
    \path[->, draw, dashed] (z_1) -- (z_2);
    \path[->, draw, dashed] (z_1) -- (x_2);
    \path[->, draw, dashed] (z_2) -- (x_1);

    \end{tikzpicture}
    \caption{
    \textit{Graphical representation of a standard latent variable model.
    If present, the dotted edges preclude the existence of an ideal inference function $f_{\mbx} (x_n) = \nu^*_n$ and the amortization gap cannot be closed.
    }}
    \label{fig:graphG}
  \end{figure}

The most obvious contribution of $x_i$ appears in the likelihood terms and is removed if and only if we exclude non-local dependence, that is for $i \neq n$, $p(x_i \mid z_n, \mbz_{-n}, \theta) = p(x_i \mid \mbz_{-n}, \theta)$.
%
Doing so for every $n$, we have
%
\begin{equation} \label{eq:nonlocal-independence}
  p(x_i \mid z_n, \mbz_{-n}, \theta) = p(x_i \mid z_i, \theta).
\end{equation}

\begin{remark}
  Here the assumption of \textit{local dependence} (Definition~3.2 (i)) is critical.
  Without it, we cannot exclude the possibility that $x_i$ does not depend on $z_i$, or any $z_j$'s other than $z_n$, and hence that $p(x_i \mid z_n, \mbz_{-n}, \theta) = p(x_i \mid z_n, \theta)$, $i \neq n$.
  Then an edge between $z_n$ and $x_i$ would not contradict the existence of an ideal inference function.
\end{remark}

Next, we have by assumption that $\nu^*_i = f_\mbx(x_i)$.
Then
%
\begin{equation}
  q(z_n \s \nu^*) \propto \exp \left \{ \int_{\Theta, {\bf \mathcal Z}_{-n}} q(\text d \theta \s \nu_0 (\mbx)) \prod_{i \neq n} q(\text d z_i \s f_\mbx(x_i)) \log p(z_n \mid \mbz_{-n}, \theta) + \log p(x_n \mid z_n, \theta) \right \}.
\end{equation}
%
The offending terms are now the variational factors $q(\text d z_i \s f_\mbx(x_i))$ in the integral.
To remove them, we must get rid of any term that couples $z_n$ and $z_i$, and so $z_n$ must be a priori independent of $z_i$, that is
%
\begin{equation} \label{eq:apriori-independence}
  p(z_n \mid \mbz_{-n}, \theta) = p(z_n \mid \theta).
\end{equation} 
%
A standard latent variable model that verifies \Cref{eq:nonlocal-independence} and \Cref{eq:apriori-independence} must also verify Eq.~1 and is therefore a simple hierarchical model. \qed

\subsection{Example of a latent variable model, which is not a simple hierarchical model and admits an ideal inference function}

The statement of Theorem~3.4, item (ii) is carefully written for all distributions supported on a graph.
To see why a simple ``if and only if'' version of item (i) is not true, consider a dense hierarchical model, with edges between all elements of $\mbx$ and $\mbz$.
If we a choose a likelihood which is symmetric in ${\bf z}$, e.g. $p(x_n \mid \mbz, \theta) = p(x_n \mid \sum_n z_n, \theta)$, then there exists a (constant) ideal inference function and moreover, all factors $q(z_n \s \nu^*_n)$ are identical.

This case is of course trivial: with such a symmetry, the notion of a local latent variable is unjustified.
To our knowledge, all examples of models, which are not simple hierarchical models and still admit an ideal inference function, rely on a similar trivialities.
These however constitute edge cases we must be mindful of when writing formal statements.

\subsection{Analytical results for the linear probabilistic model}

We prove Proposition~3.6, which provides an exact expression for the mean and variance of $q(z_n \s \nu^*)$, the optimal solution returned by F-VI when applied to the linear generative model.
In the model of interest, $\theta$ is a scalar random variable, and we introduce the fixed standard deviations, $\tau \in \mathbb R$ and $\sigma \in \mathbb R$.
Next
\begin{equation}
    p(\theta) \propto 1; \ \ p(z_n) = \mathcal N(0, 1); \ \ p(x_n) = \mathcal N(\theta + \tau z_n, \sigma).
\end{equation}
%
Since the posterior distribution $p(\theta, {\bf z} \mid {\bf x})$ is normal, $q(z_n \s \nu^*)$ can be worked out analytically \citep[e.g][]{Turner:2011, Margossian:2023}.
Specifically,
%
\begin{equation}
    q(z_n \s \nu_n^*) = \mathcal N \left(\mu_n, \frac{1}{[\Sigma^{-1}]_{nn}} \right),
\end{equation}
%
where $\mu_n$ is the correct posterior mean for $z_n$ and $\Sigma$ is the correct posterior covariance matrix.
Note that F-VI always underestimates the posterior marginal variance unless $\Sigma$ is diagonal \citep[][Theorem 3.1]{Margossian:2023}.
%
It remains to find an analytical expression for the posterior distribution.
%
\begin{lemma}
  The marginal posterior distribution is given by
  \begin{equation}
      p(z_n \mid {\bf x}) = \mathcal N \left (\frac{\tau}{\sigma^2 + \tau^2} (x_n - \bar x), s \right),
  \end{equation}
  %
  for some $s$, constant with respect to ${\bf x}$.
\end{lemma}
\begin{proof}
  From Bayes' rule
  \begin{eqnarray}  \label{eq:joint_quadratic}
    \log p({\bf z}, \theta \mid {\bf x}) & = & k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{1}{2 \sigma^2} \sum_{n = 1}^N (x_n - \theta - \tau z_n)^2 \nonumber \\
    & = & k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{1}{2 \sigma^2} \sum_{n = 1}^N \theta^2 + (x - \tau z_n)^2 - 2 \theta (x_n - \tau z_n) \nonumber \\
    & = & k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{1}{2 \sigma^2} \left ( n \theta^2 + \sum_{n = 1}^N (x_n - \tau z_n)^2 - 2 \theta \sum_{n = 1}^N (x_n - \tau z_n) \right),
  \end{eqnarray}
  %
  where $k$ is a constant with respect to ${\bf z}$ and $\theta$.
  Moving forward, we overload the notation for $k$ to designate any such constant.
  %
  As expected, Eq.~\ref{eq:joint_quadratic} is quadratic in $\theta$ and ${\bf z}$.

  \begin{remark}
  At this point, the proof may take two directions: in one, we work out the precision matrix, $\Phi$ (i.e. the inverse covariance matrix $\Sigma$) for $p({\bf z}, \theta \mid {\bf x})$ and invert it to obtain the posterior mean for each $z_n$.
  Constructing $\Phi$ is straightforward and necessary to show the covariance of $q(z_n \s \nu^*_n)$ is constant with respect to ${\bf x}$.
  However, inverting $\Phi$ requires recursively applying the Sherman-Morrison formula three times, which is algebraically tedious.
  The other direction is to marginalize out $\theta$.
  We can then construct the precision matrix $\Psi$ for $p({\bf z} \mid {\bf x})$, which only requires a single application of the Sherman-Morrison formula to invert.
  We opt for the second direction, noting both options are rather involved.
  \end{remark}
  
  To marginalize out $\theta$, we complete the square and perform a Gaussian integral,
  %
  \begin{eqnarray}
    \log p({\bf z}, \theta \mid {\bf x})& = &
    k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{n}{2 \sigma^2} \left [ \theta^2 + \frac{1}{n} \sum_{n = 1}^N (x_n - \tau z_n)^2 - 2 \theta \sum_{n = 1}^N (x_n - \tau z_n) \right . \nonumber \\
    & & + \left . \left (\frac{1}{n} \sum_{n = 1}^N (x_n - \tau z_n) \right)^2 - \left (\frac{1}{n} \sum_{n = 1}^N (x_n - \tau z_n) \right)^2 \right] \nonumber \\
    & = &  k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{n}{2 \sigma^2} \left [ \left (\theta - \frac{1}{n} \sum_{n = 1}^N (x_n - \tau z_n) \right)^2 + \frac{1}{n} \sum_{n = 1}^N (x_n - \tau z_n)^2 \right . \nonumber \\ 
    & & - \left . \left (\frac{1}{n} \sum_{n = 1}^N (x_n - \tau z_n) \right)^2 \right ]
  \end{eqnarray}
  %
  Then
  \begin{eqnarray}
    \log p({\bf z} \mid {\bf x}) = k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{1}{2 \sigma^2} \left [ \sum_{n = 1}^N (x_n - \tau z_n)^2 - \frac{1}{n} \left ( \sum_{n = 1}^N (x_n - \tau z_n) \right)^2 \right].
  \end{eqnarray}
  %
  % At this point, we may recognize the quadratic form in the exponential, expected from a Gaussian distribution.
  % The squared sum term couples the elements of ${\bf z}$, meaning the $z_n$'s are not \textit{a posteriori} independent.
  Expanding the square,
  \begin{equation}
    \left ( \sum_{n = 1}^N (x_n - \tau z_n) \right)^2 = \sum_{n = 1}^N (x_n - \tau z_n)^2 + 2 \sum_{j < n} (x_n - \tau z_n) (x_j - \tau z_j).
  \end{equation}
  %
  Plugging this in and factoring out $\tau$, we get
  \begin{equation}
      \log p({\bf z} \mid {\bf x}) = k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{\tau^2}{2 \sigma^2} \left [ \sum_{n = 1}^N \left (1 - \frac{1}{n} \right) \left (\frac{x_n}{\tau} - z_n \right)^2 - \frac{2}{n} \sum_{j < n} \left( \frac{x_n}{\tau} - z_n \right) \left (\frac{x_j}{\tau} - z_j \right) \right ].
  \end{equation}
  %
  Now the standard expression for a multivariate Gaussian is
  %
  \begin{equation}
    \log p({\bf z} \mid {\bf x}) = k - \frac{1}{2} ({\bf z} - \boldsymbol \mu)^T \Psi ({\bf z} - \boldsymbol \mu) = k -\frac{1}{2} \left (\sum_{n = 1}^N \Psi_{nn} (z_n - \mu_n)^2 + 2 \sum_{j < n} \Psi_{jn} (z_n - \mu_n)(z_j - \mu_j) \right),
  \end{equation}
  %
  where $\boldsymbol \mu$ is the mean and $\Psi$ the precision matrix.
  We solve for the mean and precision matrix by matching the coefficients in the above two expressions for $z_n$, $z_n z_j$, and $z_n^2$, which respectively produce the following equations:
  %
  \begin{align}
    \sum_{j = 1}^N \Psi_{nj} \mu_j & = \frac{\tau}{\sigma^2} (x_n - \bar x) \label{eq:posterior_mean} \\
    \Psi_{nj} & = - \frac{\tau^2}{n \sigma^2}, \ \ \forall n \neq j  \\
    \Psi_{nn} & = 1 + \frac{\tau^2}{\sigma^2} \left (1 - \frac{1}{N} \right).
  \end{align}
  %
  This immediately gives us the precision matrix.
  Eq.~\ref{eq:posterior_mean} may be rewritten in matrix form as
  %
  \begin{equation}
      \boldsymbol \mu = \frac{\tau}{\sigma^2} \Psi^{-1} [{\bf x} - \bar x {\bf 1}],
  \end{equation}
  %
  where ${\bf 1}$ is the $N$-vector of 1's.
  %
  Let $\alpha = \Psi_{nj}$, for any $n \neq j$, and $\beta = \Psi_{nn} - \alpha$.
  Then
  \begin{equation}
      \Psi = \beta I + \alpha {\bf 1 1}^T,
  \end{equation}
  %
  Applying the Sherman-Morrison formula, we obtain the covariance matrix,
  \begin{eqnarray}
      \Psi^{-1} & = & (\beta I + \alpha {\bf 1 1}^T)^{-1} \nonumber \\
        & = & \beta^{-1} I - \frac{\beta^{-1} I \alpha {\bf 1 1}^T \beta^{-1} I}{1 + \alpha {\bf 1}^T \beta^{-1} I {\bf 1}} \nonumber \\
        & = & \beta^{-1} I - \frac{\alpha \beta^{-1}}{\beta + N \alpha} {\bf 11}^T.
  \end{eqnarray}
  %
  Notice that $\Psi^{-1}$ does not depend on ${\bf x}$ and that it's diagonal elements are all equal.
  Moreover $(\Psi^{-1})_{nn}$ gives us the constant, $s$.
  %
  Next let
  \begin{equation}
      a = \beta^{-1} \frac{\tau}{\sigma^2}; \ \ \ \ b = - \frac{\alpha \beta^{-1}}{\beta + N \alpha} \frac{\tau}{\sigma^2}.
  \end{equation}
  %
  Then $\boldsymbol \mu = (a I + b {\bf 1 1}^T) [{\bf x} - \bar x {\bf 1 1}^T]$ and moreover
  \begin{eqnarray*}
      \mu_n & = & a (x_n - \bar x) + b \sum_{j = 1}^N x_j - \bar x \\
      & = & a (x_n - \bar x) \\
      & = & \frac{\tau}{\sigma^2}\left (\frac{\tau^2 + \sigma^2}{\sigma^2} \right)^{-1} (x_n - \bar x) \\
      & = & \frac{\tau}{\sigma^2 + \tau^2}(x_n - \bar x),
  \end{eqnarray*}
  as desired.
  
\end{proof}

To complete the proof of Proposition~3.4, we need to show that the variances of $q(z_n \s \nu^*)$ is constant with respect to ${\bf x}$; that they are equal for each $z_n$ follows from the symmetry of the problem.
We already constructed the precision matrix $\Psi$ for $p({\bf z} \mid {\bf x})$, but we actually need to study the full precision matrix $\Phi$ of $p(\theta, {\bf z} \mid {\bf x})$.
We use the index $0$ to denote the columns (or rows) corresponding to $\theta$.

\begin{lemma}
    The posterior precision matrix $\Phi$ of $p(\theta, {\bf z} \mid {\bf x})$ verfies
    \begin{equation}
        \Phi_{00} = \frac{N}{\sigma^2}; \ \
        \Phi_{0j} = \frac{\tau}{2 \sigma^2} \ \mathrm{if} \ j > 0; \ \
        \Phi_{nn} = 1 + \frac{\tau^2}{\sigma^2} \ \mathrm{if} \ i > 0; \ \
        \Phi_{nj} = 0, \ \mathrm{if} \ n \neq j.
    \end{equation}
    Crucially, $\Phi$ is constant with respect to ${\bf x}$.
\end{lemma}
%
\begin{proof}
    Consider Eq.~\ref{eq:joint_quadratic}, rewritten here for convenience,
    \begin{equation*}
        \log p({\bf z}, \theta \mid {\bf x}) = k - \frac{1}{2} \sum_{n = 1}^N z_n^2 - \frac{1}{2 \sigma^2} \left ( N \theta^2 + \sum_{n = 1}^N (x_n - \tau z_n)^2 - 2 \theta \sum_{n = 1}^N (x_n - \tau z_n) \right).
    \end{equation*}
    The standard Gaussian form is
    \begin{eqnarray} \label{eq:Gaussian}
      \log p({\bf z}, \theta \mid {\bf x}) & = & k - \frac{1}{2} \left [\Phi_{00} (\theta - \nu)^2 + \sum_{n = 1}^N \Phi_{nn} (z_n - \mu_n)^2 \right.  \nonumber \\ 
      & &\left . + 2 \left ( \sum_{j = 1}^N \Phi_{0j} (\theta - \nu)(z_j - \mu_j) + \sum_{j < n} \Phi_{nj} (z_n - \mu_n)(z_j - \mu_j) \right) \right].
    \end{eqnarray}
    %
    Matching coefficients for $\theta^2$, $\theta z_j$, $z_n z_j$ and $z_n^2$, we obtain respectively
        \begin{equation*}
        \Phi_{00} = \frac{N}{\sigma^2}; \ \
        \Phi_{0j} = \frac{\tau}{2 \sigma^2} \ \mathrm{if} \ j > 0; \ \
        \Phi_{nn} = 1 + \frac{\tau^2}{\sigma^2} \ \mathrm{if} \ n > 0; \ \
        \Phi_{nj} = 0, \ \mathrm{if} \ n \neq j.
    \end{equation*}
\end{proof}

    The variance of $q(z_n \s \nu^*)$ is obtained by inverting the diagonal elements of $\Phi$.
    By symmetry, \mbox{$\text{Var}_{q^*}(z_n) = \xi \ \ \forall n$}, where $\xi$ is a constant which does not depend on ${\bf x}$.
    This completes the proof of Proposition~3.4. \qed


  \subsection{Non-existence of an ideal inference function for hidden Markov models}

%  Before stating the proof, let us quickly examine why the proof used for the simple hierarchical model does not work here. This time when applying the CAVI rule (Lemma 3.1) to the hidden Markov model, we get
%    {\small
%    \begin{equation}
%        q(z_n \s \nu^*_n) \propto \exp \left \{ \mathbb E_{q(\theta \s \nu_0^*)} \log p(x_n \mid z_n, \theta) + \mathbb E_{q(z_{n - 1} \s \nu^*)} \log p(z_n \mid z_{n - 1}) + \mathbb E_{q(z_{n + 1} \s \nu^*)} \log p(z_{n + 1} \mid z_n) \right\}.
%    \end{equation}
%    }
%    %
%    Because the prior on $\mbz$ does not factorize, we pick up two additional terms through $p(z_n \mid z_{n - 1})$ and $p(z_{n + 1} \mid z_n)$, which prevents us from finishing the proof as we did in Section~\ref{app:simple-hier}. Nonetheless we should be mindful that the above does not immediately give us counter-example in which a learnable inference function does not exist.
%
%    Now to the proof. 
    To prove Proposition~3.8, we construct an example for which the optimal F-VI solution, using a factorized Gaussian approximation, can be written in a nearly closed form, and show that the optimal variational factors $\nu^*_n$ take different values even when all the values of $\mbx$ are equal.
    Then for any strict subset ${\bf w}_n \in \mbx$, we have ${\bf w}_n = {\bf w}_m$ but $\nu^*_n \neq \nu^*_m$.
    This provides our counter-example.

    Consider the model
    \begin{equation} \label{eq:hmm-simple}
      p(z_0) \propto 1 \s p(z_n \mid z_{n - 1}) = \mathcal N(z_{n - 1}, 1) \s p(x_n \mid z_n) = \mathcal N(z_n, 1),
    \end{equation}
    %
    where $\theta$ is held fixed, say to a point estimate $\hat \theta$, and ignored for the rest of this analysis.
    Applying Bayes' rule and expanding
    \begin{eqnarray*}
      \log p(\mbz \mid \mbx) & = & k - \frac{1}{2} \sum_{n = 1}^N (z_n - z_{n - 1})^2 + (x_n - z_n)^2 \\
      & = & - \frac{1}{2} \sum_{n = 1}^N 2 z_n^2 + z_{n - 1}^2 - 2 x_n z_n - 2 z_n z_{n - 1},
    \end{eqnarray*}
    %
    which is a quadratic form in $\mbz$ and hence a Gaussian.
    Matching the coefficients for $z_n$, $z_n z_j$ and $z_n^2$ to the standard expression for a multivariate Gaussian (Eq.~\ref{eq:Gaussian}), we get
    %
    \begin{eqnarray}
        \sum_{j = 1}^N \Psi_{nj} \mu_j & = - 2 x_n  \\
        \Psi_{n j} & = - 2 & \text{if} \ j = n - 1 \ \text{or} \ j = n + 1 \\
        \Psi_{nn} & = 3 & \text{if} \ n \ge 1 \\
        \Psi_{00} & = 1.
    \end{eqnarray}
    %
    All non-specified elements of $\Psi$ go to 0.
    Moreover the precision matrix $\Psi$ is tri-diagonal.
    The posterior mean solves the linear problem,
    \begin{equation}
      \boldsymbol \mu = - 2 \Psi^{-1} \mbx.
    \end{equation}
    %
    Since the variational family and the target are both Gaussian, the optimal variational mean is simply the posterior mean and $\nu^* = \boldsymbol \mu$. Even though the elements of $\mbx$ are all equal, it is in general not the case that the elements of $\nu^*$ are constant.
    %
    To see this explicitly, we take $N = 100$ and $x_1 = x_2 = \cdots = x_N = 1$, and find that the elements of $\nu^*$ are indeed distinct (Figure~\ref{fig:nu-hmm}).
    This shows that there exists a hidden Markov model and a realization of the data ${\bf x}$ such that no learnable inference function exists. \qed 

    \begin{figure}
      \centering
      \includegraphics[width=5in]{figures/nu_hmm.pdf}
      \caption{\textit{Optimal variational means when using a Gaussian F-VI on a hidden Markov model (\Cref{eq:hmm-simple}). Even though the elements of ${\bf x}$ are all equal, the optimal variational means take on different values and so no inference function $f_\phi:\bf w_n \to \nu^*_n$ can be constructed, for any subset $\bf w_n \in \bf x$}.}
      \label{fig:nu-hmm}
    \end{figure}

    \section{ADDITIONAL EXPERIMENTAL RESULTS}

    {\bf Hardware.} All experiments are conducted in \texttt{Python} 3.9.15 with \texttt{PyTorch} 1.13.1 and \texttt{CUDA} 12.0 using an NVIDIA RTX A6000 GPU.

    {\bf Reconstruction error on test set for Bayesian neural network.} We consider the reconstruction error on a test set of 10,000 images (\Cref{fig:mse-test}).
    The reconstructed image is obtained by (i) computing $q(z' \mid x')$ using the inference function $f_\phi$ and (ii) feeding $\mathbb E_q(z' \mid x')$ into the likelihood neural network $\Omega$ (in the VAE context, the ``decoder'') to obtain $\hat x'$.
    $\Omega$ is evaluated at the Bayes estimator $\hat \theta = \mathbb E_q(\theta \mid \mbx)$.
    F-VI provides no automatic way of doing step (i) (one would need to learn $q(z' \s \nu')$ by running F-VI from scratch), and so we do not evaluate it on the test set.
    Overall, we find the model generalizes well, and the test error is very close to the training error.

    \begin{figure}
        \centering
        \includegraphics[width=3in]{figures/mse_test_BNN_1797.pdf}
        \caption{\textit{Reconstruction MSE on a test set.}}
        \label{fig:mse-test}
    \end{figure}

    % All experiments are conducted in \texttt{Python} 3.9.15 with \texttt{PyTorch} 1.13.1 and \texttt{CUDA} 12.0 using an NVIDIA RTX A6000 GPU.





