% !TEX root = ../main.tex
CVGP is, to the best of our knowledge,
the first variational $\gp$ inference method that leverages
a coreset-based posterior for efficiency and scalability.
It diverges from alternative sparse $\gp$ inference techniques in that
its posterior is based on a coreset triplet $\{\XbC, \ybC, \betabC\}$:
\begin{itemize}[leftmargin=2ex]
    \item CVGP is not restricted to a sparse selection of observed inputs:
    $\XbC$ is a vector of free parameters, within the data domain,
    but \textbf{not restricted to the empirical data}.
    
    \item CVGP does not learn inducing variables $\mb = \Ex{\q{\fbZ}}{\fbZ} $,
    \ie  posterior $\gp$ mean function values evaluated at inducing points $\XbZ$.
    Instead, it \textbf{learns pseudo-observations $\ybC$} that encapsulate (\ie capture the characteristics of) the observed data (\eg Figure \ref{fig:exp_coresets_histograms}).
    
    \item CVGP is the only existing $\gp$ method that \textbf{reweights the pseudo-observations} with learnable parameters $\betabC$, for flexibility and explainability of its coreset-based posterior:
    \ie it learns which pseudo-points are important
    for accurate $\gp$ posterior approximation
    (Figures \ref{fig:exp_coresets_predictive} and \ref{fig:exp_coresets_histograms}).
\end{itemize}

\paragraph{Comparison to non-variational sparse $\gp$s.} 
Selection of $\gp$ inputs from within the training data involves a prohibitive combinatorial optimization
that may require greedy optimization~\citep{csato2002sparse},
based on posterior maximization~\citep{smola2000},
maximum information gain~\citep{seeger2003fast},
matching pursuit~\citep{keerthi2005},
or other techniques~\citep{quinonero2005unifying}.
%
On the contrary, CVGP leverages \emph{stochastic optimization}
to find a weighted subset of pseudo-points
that efficiently approximate the $\gp$ posterior,
sharing resemblance with the pioneer work of~\citet{snelson2005sparse}.
To circumvent overestimation of the marginal likelihood and under-estimation of the noise variance as reported by~\citet{titsias2009variational,bauer2016},
CVGP resorts to variational inference.
Hence, CVGP shares the variational formulation of \citet{titsias2009variational} and \citet{hensman2013gaussian},
yet is distinct in several important aspects.

\paragraph{Comparison to variational sparse $\gp$s.}
CVGP aligns with the approach by \citet{titsias2009variational} in the use of a variational lower-bound on the marginal log-likelihood that leverages the $\gp$ prior’s conditional dependency, \ie $\q{\fb, \fbC} = \cp{\fb}{\fbC} \q{\fbC}$, and analytically marginalizes $\q{\fbC}$. In contrast, SVGP does not marginalize this distribution and devises a different lower-bound for stochastic optimization. As a result, SparseGP and CVGP posteriors directly incorporate the $\gp$ prior’s inductive biases and the likelihood model. The main difference is in the choice of $\q{\fbC}$:

\begin{itemize}[leftmargin=2ex]
    \item SparseGP derives \textit{the optimum distribution} at inputs $\XbZ$ over function values $\fbZ$, given observed data $\yb$:
    {
    \begin{align}
	& q^\star(\fbZ)=\N{\fbZ; \mbstar{\fbZ}, \Kbstar{\fbZ}{\fbZ}} \label{eq:sparseGP_optimal_q}, \; \text{with } \\
    &  \begin{cases}
        \mbstar{\fbZ} = \KbZZ \red{\left(\sigma^2 \KbZZ + \KbZX \KbXZ\right)^{-1}} \green{\KbZX \yb}\\ 
        \Kbstar{\fbZ}{\fbZ} = \KbZZ  \blue{\left( \KbZZ + \frac{1}{\sigma^2} \KbZX \KbXZ\right)^{-1}} \KbZZ
    \end{cases} \nonumber %\\
    \end{align}
    }
    
    \item CVGP defines \textit{a learnable distribution} $q(\XbC)$ with free coreset parameter triplet $\{\XbC, \ybC, \betabC\}:$
    {
    \begin{align}
	& \q{\fbC}=\N{\fbC; \mb_{\fbC|\ybC}, \Kb_{\fbC|\ybC}}  \label{eq:cvtgp_q}, \;\text{with} \\
        &  \begin{cases}
		 \mb_{\fbC|\ybC} = \KbCC \red{\left( \KbCC + \SigmabetaC \right)^{-1}} \green{\ybC} \\
		\Kb_{\fbC|\ybC} = 
       \KbCC \blue{\left[\KbCC^{-1} -  \left( \KbCC + \SigmabetaC \right)^{-1} \right]} \KbCC  \nonumber\\
	\end{cases}
    \end{align}
    }
\end{itemize}

We note that the building blocks of CVGP's coreset based posterior
are analogous to SparseGP's optimal posterior:
CVGP's learned pseudo-observations \green{$\ybC$}
can be viewed as a weighted combination of observed datapoints,
\ie the \green{$\KbZX \yb$} term in SparsedGP's posterior mean.
In addition, CVGP pseudo-observations \green{$\ybC$} are modulated
by the \red{$\left( \KbCC + \SigmabetaC \right)^{-1}$} term in its posterior mean;
in SparseGP, the \red{$\left(\sigma^2 \KbZZ + \KbZX \KbXZ\right)^{-1}$} term
similarly weights the transformed observations \green{$\KbZX \yb$}.
In both posterior distributions,
these terms in red are responsible for balancing
the prior inductive biases with the information provided by observed data:
\ie the posterior means interpolate between the prior and observations.
%
A similar dependency between the prior and the information provided by data
is observed in the posterior covariances:
\ie the blue terms in both posteriors adapt the prior covariance to account for the uncertainty reduction due to observations.
In CVGP, this balance is adjusted through the learnable matrix \blue{$\SigmabetaC$},
whereas in SparseGP, it is determined by the fixed dependency set by the prior covariance and the likelihood noise, \ie \blue{$\frac{1}{\sigma^2}\KbZX\KbXZ$}.

Notably, as shown in Appendix Section~\ref{assec:cvgp_lower_bound_optimum},
when CVGP matrix $\SigmabetaC$ matches the appropriate weighting,
the optimum of SparseGP and CVGP's loss-functions are identical.
Hence, the learned solutions match
with $\ybC = \sigma^{-2}\SigmabetaC^* \yb$
and $\SigmabetaC^* = \sigma^{2}\KbCC \left(\KbCX \KbXC\right)^{-1} \KbCC$,
recovering ~\citet{titsias2009variational}'s optimal solution.
We empirically showcase CVGP's ability to quickly and
\textbf{efficiently close the gap to ExactGP's marginal log-likelihood}
in Section~\ref{ssec:exp_inference}.

Contrary to SparseGP,
CVGP's loss in Equation~\eqref{eq:loss_coreset_posterior_gp_analytical} 
is amenable to stochastic optimization,
making sparse $\gp$ regression scalable at reduced complexity.
CVGP matches SVGP’s scalability \citep{hensman2013gaussian},
yet offers two key advantages:
linear parameter complexity of order $\bigO{M}$,
and a distinct optimization landscape.
These arise from different design choices over $q(\fbC)$:
whereas SVGP’s free-form $q(\fbZ) = \N{\fbZ \mid \mb, \Sb}$ requires $\bigO{M^2}$ parameters and yields statistics ($\mathbf{m}, \mathbf{S}$) not directly tied to the model or data likelihood;
CVGP’s posterior in Equation~\eqref{eq:coreset_gp_posterior_C_f}
leverages the model’s inductive biases,
acting as a \textbf{natural interpolation between the $\gp$ prior and the data likelihood.}\footnote{
We analyze CVGP’s prior to posterior noise adaptation as a function of observation noise levels
in Appendix~\ref{assec:app_exp_noisy}.
}
These structural differences produce distinct loss landscapes, with SVGP’s higher-dimensional optimization often struggling to converge, as shown in Section~\ref{ssec:exp_inference}.