% !TEX root = ../main.tex
Training $\gp$s efficiently with large datasets has been a long-standing challenge,
as exact inference complexities grow $O(N^3)$ in time and $O(N^2)$ in space requirements.
%
Successful state-of-the-art (SOTA) methods to scale $\gp$s
---a detailed review can be found in~\citep{liu2020gaussian}---
are based on sparse and low-rank approximations~\citep{williams2000using, snelson2005sparse, quinonero2005unifying},
often using inducing random variables ~\citep{naish2007generalized, titsias2009variational, hensman2013gaussian,wilson2015kernel}.

Amongst these techniques,
variational learning of inducing variables by~\citet{titsias2009variational} allows for time and space complexities of $O(NM^2)$ and $O(NM)$,
with clear benefits when inducing point size $M$ is small, \ie $M \leq N$.
However, in real-world applications, $N$ can be in the order of millions, making model learning impractical.
%
More recently, \citet{hensman2013gaussian} introduced stochastic variational inference for Gaussian processes (SVGP), which reduces the time and space complexities to \(\mathcal{O}(M^3)\) and \(\mathcal{O}(M^2)\), respectively. This method has become the standard for training \(\mathcal{GP}\) models on large datasets. However, SVGP’s scalability comes at a cost: it requires learning additional \(\mathcal{O}(M^2)\) parameters, resulting in an optimization problem that scales quadratically with the number of inducing points.

In this work,
we propose a coreset-based variational $\gp$ (CVGP) technique
that is amenable to stochastic optimization (\ie scalable to big datasets)
at reduced $\bigO{M}$ parameter complexity (see Table~\ref{table:bigo}),
and demonstrate its accurate inference and predictive performance in a wide range of real-datasets (see results in Section~\ref{sec:experiments}).

We take inspiration from~\citet{titsias2009variational}'s optimal variational posterior,
and ensure that CVGP's variational family also obeys 
(1) the $\gp$s' prior-conditional structure,
and (2) the $\gp$ prior's dependencies in its posterior,
all achieved via Bayesian coreset principles~\citep{huggins2016coresets, zhang2021bayesian}.
Specifically, we design and learn a variational distribution for a $\gp$-based probabilistic model,
defined through a subset of learnable pseudo-points and a weighted likelihood function,
in line with the Black-Box Bayesian coreset framework~\citep{manousakas2020bayesian, manousakas2022black}.

CVGP's coreset-based variational $\gp$ posterior,
learnable via stochastic maximization of
a lower-bound of the log-marginal data likelihood, 
enables not only a more accurate approximation to the true $\gp$ regression posterior,
but a more efficient optimization process.

In summary, our contribution is a novel, coreset-based stochastic variational $\gp$ inference (CVGP) algorithm that:
\begin{enumerate}[leftmargin=2.5ex]
    \item %(1) 
    Finds a coreset-based, sparse variational posterior to faithfully approximate the true $\gp$ posterior,
    enabling up- and down-weighting the influence of pseudo-points during learning (Section \ref{ssec:exp_coresets} and Appendix \ref{asssec:app_exp_posterior_predictive},\ref{asssec:app_exp_coreset_weights});

    \item %(2) 
    Maximizes a lower-bound over the marginal log-likelihood that is amenable to efficient stochastic optimization (Section \ref{ssec:cvtgp_lowerbound});
    
    \item %(3) 
    Provides a numerically stable algorithm requiring only $\bigO{M}$ parameters to be learned, at computational and memory complexities of $\bigO{M^3}$ and $\bigO{M^2}$ (Table~\ref{table:bigo}); %and
    
    \item %(4) 
    Outperforms SOTA stochastic variational $\gp$ inference alternatives on real-world regression datasets (Section \ref{sec:experiments}):
    CVGP not only provides improved predictive performance (Section \ref{ssec:exp_predictive}),
    but achieves a tighter lower variational bound than alternatives (Section \ref{ssec:exp_inference}).

\end{enumerate}