\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{subfigure}
\usepackage{algorithm}

% \usepackage{hyperref}
% \usepackage{nameref}
% \usepackage{zref-xr}
% \zxrsetup{toltxlabel}
% % uai_submission/
% \zexternaldocument*{achituve_219-supp}

\usepackage{xr}
\makeatletter

\newcommand*{\addFileDependency}[1]{% argument=file name and extension
\typeout{(#1)}% latexmk will find this if $recorder=0
% however, in that case, it will ignore #1 if it is a .aux or 
% .pdf file etc and it exists! If it doesn't exist, it will appear 
% in the list of dependents regardless)
%
% Write the following if you want it to appear in \listfiles 
% --- although not really necessary and latexmk doesn't use this
%
\@addtofilelist{#1}
%
% latexmk will find this message if #1 doesn't exist (yet)
\IfFileExists{#1}{}{\typeout{No file #1.}}
}\makeatother

\newcommand*{\myexternaldocument}[1]{%
\externaldocument{#1}%
\addFileDependency{#1.tex}%
\addFileDependency{#1.aux}%
}
%------------End of helper code--------------

% put all the external documents here!
\myexternaldocument{achituve_219-supp}

\input{math_commands}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)


\title{Guided Deep Kernel Learning}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<Idan.Achituve@biu.ac.il>?Subject=GDKL on UAI2023}{Idan Achituve}}
\author[2, 3]{Gal Chechik}
\author[1]{Ethan Fetaya}
% Add affiliations after the authors
\affil[1]{%
    Faculty of Engineering\\
    Bar-Ilan University\\
    Israel
}
\affil[2]{%
    Computer Science Dept.\\
    Bar-Ilan University\\
    Israel
}
\affil[3]{%
    NVIDIA\\
    Israel
  }


\begin{document}
\maketitle
\begin{abstract}
Combining Gaussian processes with the expressive power of deep neural networks is commonly done nowadays through deep kernel learning (DKL). Unfortunately, due to the kernel optimization process, this often results in losing their Bayesian benefits.
In this study, we present a novel approach for learning deep kernels by utilizing infinite-width neural networks. We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach harnesses the reliable uncertainty estimation of the NNGPs to adapt the DKL target confidence when it encounters novel data points. As a result, we get the best of both worlds, we leverage the Bayesian behavior of the NNGP, namely its robustness to overfitting, and accurate uncertainty estimation, while maintaining the generalization abilities, scalability, and flexibility of deep kernels. Empirically, we show on multiple benchmark datasets of varying sizes and dimensionality, that our method is robust to overfitting, has good predictive performance, and provides reliable uncertainty estimations.

%Unfortunately, this often results in the loss of the Gaussian processes' Bayesian benefits due to the predictive optimization of the kernel on the training data. In this work, we present a novel approach to deep kernel learning by using  infinite-width neural network, modeled by a Neural Network Gaussian Process (NNGP) as guidance. This approach uses the NNGP reliable uncertainty estimation to adapt our target confidence on novel data points. This allows us to get the best of both worlds, we maintain the Bayesian behavior of the NNGP - robustness to overfitting, and accurate uncertainty estimation while keeping the accuracy, scalability, and flexibility of deep kernel learning.  Empirically we show on multiple benchmarks that our method consistently works well on both performance and uncertainty estimation metrics.

\end{abstract}
%Deep kernel learning is a common approach to enhance Gaussian processes with the expressive power of deep neural networks. Unfortunately, as the network is optimized on the training data, the overall behavior is similar to a standard neural network and we lose the benefits of the  Bayesian approach.

\section{Introduction}
Gaussian processes (GPs) are an effective Bayesian non-parametric family of models. They have several appealing features, such as tractable inference, accurate uncertainty estimation, and the ability to generalize well from small datasets \citep{gp_book, snell2020bayesian, achituve2021personalized}. In GPs, the kernel function is the crucial factor that determines their performance, as it measures the similarity between data points and significantly impacts which functions the model considers probable. Standard kernels, such as RBF kernels, perform well on certain learning problems, but they are inadequate for complex data modalities, like images and texts, failing to capture the desired semantic similarity. One appealing solution is to combine GPs with the expressive power of Neural Networks (NNs). There are two popular ways to achieve that. The first is through learning deep kernels, and the second is through kernels that correspond to infinite-width networks. In what follows we present both approaches, their limitations, and our proposed approach that combines the two. 

% We will now describe two standard approaches to combine GPs and NNs and their limitations. Then, we will describe and our approach that tries to overcome these limitations.


One popular way to combine GPs and NNs is through \textit{deep kernel learning} (DKL) \citep{calandra2016manifold, gordon16_DKL}. DKL uses a standard kernel over an embedding learned by a neural network, combining the tractable inference of GPs with the expressive power of deep neural networks (DNNs). Unfortunately, despite appearing to be a natural way to combine the benefits of GPs and DNNs, DKL often falls short of expectations in practice. A recent study found that deep kernels can severely overfit, sometimes even worse than standard NNs \citep{ober2021promises}. This work suggests that the DKL overfitting is caused by the optimization process  ``over-correlating'' the data points.
%The reason is that gaining the expressive power of deep networks with the DKL optimization process, which can ``over-correlate'' the data points \citep{ober2021promises}.
%It was found that combining the expressive power of deep networks with the DKL optimization process can ``over-correlate'' the data points \citep{ober2021promises}.

%DKL holds great promise, it can be used with any powerful NN architectures



%it has the flexibility to utilize vast NN architectures specialized for various data types it also allows it to easily inherit powerful mechanisms of standard GPs, such as marginal-likelihood based parameter optimization and scaling to large datasets via inducing points \citep{silverman1985some, quinonero2005unifying, sneldon_Gharamani_IP}. All of that with the added benefit of calibrated uncertainty estimation. However, in practice, DKL doesn't meet the expectation. It was found that deep kernels can severely overfit, sometimes even worse than standard NNs \citep{ober2021promises}. Thus, eliminating the benefits that the GPs were suppose to bring.

An alternative way to link DNNs and GPs, without relying on DKL, is through the equivalence between GPs and \textit{infinite-width} deep neural networks  \citep{neal2012bayesian, LeeBNSPS18, MatthewsHRTG18,Garriga-AlonsoR19, NovakXBLYHAPS19,Yang2019scaling}.  
Specifically, consider the distribution over DNN weights when they are initialized i.i.d. As the width of the DNN layers increases to infinity, the distribution of functions represented by the NN converges to a Gaussian process. Importantly, that GP has a kernel function that can be computed efficiently despite having an infinite width.  The main advantage of this approach is clear - it allows us to apply tractable Bayesian inference to highly expressive neural networks of infinite width. And, as the structure of DNNs provide valuable inductive biases for many data modalities, they can generate a corresponding kernel that is better suited to various data modalities. 
%e.g. Convolutional NNs for images,
%Indeed, this approach has been extended to other architectures as well, including CNNs \citep{Garriga-AlonsoR19, NovakXBLYHAPS19}, and RNNs \citep{Yang2019scaling}.
%\gal{IS any of this new? our contribution?}\ef{no}



%An alternative approach to DKL is the class of models that considers the equivalence between GPs and infinite width deep NNs \cite{neal2012bayesian, LeeBNSPS18, MatthewsHRTG18}. Given a deep NN with random i.i.d initialization, as the layers tend to infinity, the central-limit theorem (CLT)
%implies that the function represented by the NN corresponds to a function drawn from a GP. Concretely, this means that the (infinite) NN structure defines a specific kernel. As some structures encode better the inductive biases of different data modalities (e.g., Convolutional NNs for images), the corresponding kernel will be more suitable for measuring similarity on the data they were designed to model. Indeed, this approach has been extended to other architectures as well, including CNNs \citep{Garriga-AlonsoR19, NovakXBLYHAPS19}, and RNNs \citep{Yang2019scaling}. The appeal of this approach is clear, it allows to obtain a highly expressive Bayesian NN without the need to learn any of its parameters. 
This approach, however, also has several drawbacks that hinder its widespread adoption.
%\gal{Too long for me. at this point we should start describing our knight, not a second dragon. Perhaps cut and merge the first two parargaphs?}\ef{I think this is unavoidable. Our sales pitch is that we get the "best of both worlds", so we need to explain what they are before we get to our point.}
 First and foremost, in many cases, these models underperform  standard NNs that were optimized for a specific task \citep{NovakXBLYHAPS19}. One possible explanation for this is that the success of DNNs is connected to the implicit bias in the optimization process (e.g. \citep{VardiS21}), which can not be captured by them. Second, the evaluation of the kernel in training and inference time can be costly. This is partially due to the fact that the NN kernel needs to be computed for every pair of data points, in comparison to DKL where we run the network on each datum once before applying a standard kernel. Finally, it is challenging to incorporate established mechanisms such as inducing point techniques with these types of models. This is in contrast to DKLs which are more flexible and easier to scale.
%In contrast, the simplicity of the DKL approach gives flexibility that can be used with inducing points and also to modify it for various learning settings, e.g. \citep{achituve2021personalized}.
%First, the evaluation of the kernel on the data-points can be costly due to several factors, such as the dataset size, the depth of the network, the type of the layers (e.g., invariant layers such as pooling), and the dimension of the input. Importantly, we often want machine-learning systems that generate fast predictions, and 
%these factors effect the prediction time as well.
%Second, in many cases these models substantially under-perform finite NNs that were optimized for a specific task \citep{NovakXBLYHAPS19}. And lastly, it is challenging to incorporate established mechanisms such as inducing point techniques with these types of models \ia{Ethan - please verify the last line}.
%
%Considering the advantages and drawbacks of each method it is only natural to ask, can we combine them to get the benefits of both approaches? Specifically, can we leverage infinite NNs to remedy the overfitting and better optimize deep kernels? 
%
The limitations of current solutions raise the question: \emph{{How can we combine GPs with NNs without compromising performance or uncertainty estimation?}} 

This paper proposes a solution to the above question, which we call \textit{Guided Deep Kernel Learning} (GDKL). GDKL combines the benefits of DKL with NNGPs, by leveraging the uncertainty estimation of NNGPs to guide the DKL optimization process. To this end, we propose a novel procedure to optimize deep kernels by having them match the distribution of the NNGP's latent function given the target value. For example, consider a regression task, the DKL will try to match a Gaussian centered near the target with an adaptive level of certainty that depends on the NNGP. We show that this approach achieves the best of both worlds. It enjoys the flexibility, scalability, and predictive capabilities of DKL, while retaining the Bayesian benefits of GPs. Namely, our method can estimate uncertainty more reliably and is drastically less prone to overfitting than DKL, without sacrificing performance.  The experiments show the superiority of our method against natural baseline methods on several benchmark datasets in terms of both performance and uncertainty quantification.

%\gal{We need a sharper "our solution" paragraph. Here is a skeleton: }
%\gal{This paper offers a solution, called Guided Deep Kernel Learning (GDKL) to the above problem. GDKL combines the benefits of DKL with XXX, by leverging the tractability of GPs with the expressive power of DNNS. TO this end, we improve DKL and show how to match a Gaussian centered near a true label with an adaptive level of certainty that depends on the NNGP.}


%In this work, we show this is indeed the case. We define an objective \gal{Need something stronger} that tries to match the output distribution of the DKL model with that of the infinite-width NN \emph{conditioned on the true label}. As a result, the goal is to match a Gaussian centered near the true label, but with an adaptive level of certainty that depends on the  NNGP. We name our model \textit{Guided Deep Kernel Learning (GDKL)} as the NNGP guides the DKL optimization towards solutions that have Bayesian characteristics. 
%Importantly as the infinite-width NN serve as prior in our model, we are free to chose architectures with varying complexity according to one needs and computational budget.
%As a result, we get the best of both worlds, we can enjoy the flexibility, scalability and predictive capabilities of DKL, while retaining the Bayesian benefits of GPs. Namely, our method can estimate uncertainty more reliably and is drastically less prone to overfitting than DKL, without sacrificing performance.  The experiments show the superiority of our method against natural baseline methods on several benchmark datasets in terms of both performance and uncertainty quantification.



%Following the intuition of the seminal work of \cite{titsias2009variational}, we would like to make the NN parameters, variational parameters which will render them protected from overfitting \cite{matthews2016sparse}. To do so, we build an extremely simple method inspired by  the fELBO objective \citep{sunfunctional}, which lower-bounds the KL-divergence between stochastic processes. We use infinite architectures as strong prior for training the DKL parameters based the predictive distribution. Importantly as these architectures serve as prior only, we are free to chose architectures with varying complexity according to one needs and computational budget. As a result, we get the best of both worlds, we are able to train deep kernels even on extremely small detests without overfitting while maintaining the Bayesian benefits of GPs using a model that is as fast as standard deep kernels when making predictions. The experiments show the superiority of our method against baseline methods on several benchmark datasets. 

%Considering the advantages and drawbacks of each method it is only natural to ask, can we use one approach to reconcile the limitations of the other approach? more specifically, can we leverage infinite NNs to remedy the overfitting and optimization issues associated with deep kernels? In this study we answer affirmative to this question. Following the intuition of the seminal work of \cite{titsias2009variational}, we would like to make the NN parameters, variational parameters which will render them protected from overfitting \cite{matthews2016sparse}. To do so, we build an extremely simple method inspired by  the fELBO objective \citep{sunfunctional}, which lower-bounds the KL-divergence between stochastic processes. We use infinite architectures as strong prior for training the DKL parameters based the predictive distribution. Importantly as these architectures serve as prior only, we are free to chose architectures with varying complexity according to one needs and computational budget. As a result, we get the best of both worlds, we are able to train deep kernels even on extremely small detests without overfitting while maintaining the Bayesian benefits of GPs using a model that is as fast as standard deep kernels when making predictions. The experiments show the superiority of our method against baseline methods on several benchmark datasets. 

%To summarize, in t
This paper makes the following novel contributions. (i) We propose GDKL, a novel method to train deep kernels having their uncertainty calibrated by infinite-width networks; 
(ii) GDKL allows to perform either exact inference or approximate inference using common inducing point techniques; (iii) we demonstrate the benefits of GDKL over baseline methods for small- to mid-sized benchmark datasets with low and high data dimensionality. We conclude that GDKL generalizes well, can estimate uncertainty more reliably, and is significantly more robust against overfitting compared to standard deep kernels and competing methods.

\section{Background}
\paragraph{Notations.}
We denote scalars with lower-case letters (e.g., $x$), vectors with bold lower-case letters, (e.g., $\rvx$), and matrices with bold capital letters (e.g., $\rmX$). Given a dataset $\gD = \{(\rvx_1,\rvy_1),...,(\rvx_n,\rvy_n)\}$, we denote by $\rmX\in\sR^{n\times d}$ and $\rmY\in\sR^{n\times c}$ the design and label matrices whose $i^{th}$ row is $\rvx_i$ and $\rvy_i$ respectively.
%We denote vectors with bold lower-case font, e.g. $\rvx$, and matrices with capital bold font, e.g. $\rvx$.

\paragraph{Gaussian Processes.}
Gaussian processes are a family of Bayesian non-parametric models. GPs assume that the mapping from input points to the target values is via latent functions $\gF = \{f^1, ..., f^c\}$. In this study, we assume independence between the latent function values. Consider a single output dimension process $f(\cdot)$, a GP is fully specified by the mean function $m(\rvx)$ and the covariance function $k(\rvx,\rvx')$. We denote it by  $f(\rvx)\sim\mathcal{GP}(m(\rvx),~k(\rvx,\rvx'))$. The mean $m(\rvx)$ is commonly taken to be the constant zero function, and the kernel $k(\rvx,\rvx')$ is a positive semi-definite function. The kernel defines the correlation between function values at different input locations. Thus, it is the main contributing factor in predicting on novel inputs. One of the major benefits of GPs is that in regression tasks with Gaussian noise, $p(y_i|f(\rvx_i))=\mathcal{N}(f(\rvx_i),\sigma^2_n)$, the inference has a closed-form Gaussian solution. Specifically, we have analytical expressions for the posterior $p(f_*|\rvx_*,\gD)$ and the marginal $p(y_*|\rvx_*,\gD)$ where $\gD$ is the training data and $\rvx_*$ is a test data point. The hyper-parameters of the GPs, which we will refer to as parameters in this study, are commonly optimized using the marginal likelihood. Here, we promote the use of the predictive distribution to learn them. Several studies considered this approach in the literature (e.g., \citep{jankowiak2020parametric, snell2020bayesian, achituve2021personalized, lotfi2022bayesian}). Usually, this objective leads to better predictive abilities, yet as we will show, it is not robust against overfitting when training deep kernels.

% \ef{do we need the explicit formula?} \ia{No. usually in papers people don't give the explicit formulas.}
% \begin{equation}
%     \begin{aligned} 
%     &p(f_*|\rvx_*, \by, \rvx)=\mathcal{N}(\mu_*,~\sigma^2_*),\\
%     &\mu_*=\bk_{*}^T(\bK+\sigma^2_n \bld{I})^{-1}\by,\\
%     &\sigma^2_* = k_{**} - \bk_*^T(\bK+\sigma^2_n\bld{I})^{-1}\bk_*.
%     \end{aligned} 
% \end{equation}
% Where, $\rvx,\by$ are the training data, $\rvx_*$ is the test input, $K_{ij}=k(\rvx_i,\rvx_j)$ $k_{**} = k(x_*, x_*)$, and $\bk_*[i]=k(\rvx_i, \rvx_*)$.

%The target values are assumed to be independent when conditioned on the latent function, i.e., $p(\by|\rvx,f)=\prod_{i=1}^np(y_i|f(\rvx_i))$.

%where the evaluation vector of $f$ on $\rvx$,  $\bff=[f(\rvx_1),...,f(\rvx_n)]^T$, has a Gaussian distribution $\bff\sim\mathcal{N}(\boldsymbol{\mu},~\bK)$, where $\boldsymbol{\mu}_i=m(\rvx_i)$ and $\bK_{ij}=k(\rvx_i,\rvx_j)$. The mean $m(\rvx)$ is commonly taken to be the constant zero function, and the kernel $k(\rvx,\rvx')$ is a positive semi-definite function. %The kernel plays a critical role as it defines the correlation between function values at different input locations and it is the main factor in predicting on novel inputs.
%For regression tasks with a Gaussian noise, $p(y|f(\rvx_i))=\mathcal{N}(f(\rvx_i),\sigma)$, and the inference has a close-form solution. 

%Let $\rvx,\by$ be the training data, and let $f_*$ be the evaluation of $f$ on a novel point $\rvx_*$. In the regression case, we assume $p(y|\rvx,f)=\normal(f(\rvx),~\sigma^2)$. Therefore, the predictive distributions, $p(f_*|\rvx_*, \by, \rvx)$ and
%$p(y_*|\rvx_*, \by, \rvx)=\int p(f_*|\rvx_*, \by, \rvx)p(y_*|f_*)df_*$, are Gaussians with known parameters. Specifically,
%\ia{I think the formulas are not necessary, so I put them under remark.}\ef{I wanted to show the $K^{-1}$ term, can always remove later if we are short on space}
%and their parameters have a closed form solution:
% \iffalse
% \begin{equation}
%     \begin{aligned} 
%     &p(f_*|\rvx_*, \by, \rvx)=\mathcal{N}(\mu_*,~\sigma_*),\\
%     &\mu_*=\bk_{*}^T(\bK+\sigma^2 \bld{I})^{-1}\by,\\
%     &\sigma_* = k_{**} - \bk_*^T(\bK+\sigma^2\bld{I})^{-1}\bk_*.
%     \end{aligned} 
% \end{equation}
% Where, $k_{**} = k(x_*, x_*)$, and $\bk_*[i]=k(\rvx_i, \rvx_*)$.
% This closed-form solution allows us to avoid the costly marginalization step; however, it entails the inversion of an $n\times n$ matrix which can be expensive to compute for large datasets.

% DKL \citep{calandra2016manifold, gordon16_DKL} is a popular choice to apply a kernel on structured data such as images. The kernel over the input data points is commonly in the form of a fixed kernel on an embedding learned by a deep neural network $g_\theta$, e.g., $k_\theta(\rvx,\rvx')=\exp(-||g_\theta(\rvx)-g_\theta(\rvx')||^2/{2\ell^2})$. Therefore, the closed-form inference is of even greater importance as it allows to easily backpropagate through the GP inference. % to update the network parameters.
% \fi
\paragraph{Deep Kernel Learning}
In \citep{gordon16_DKL}, the authors proposed to combine deep neural networks with GPs by applying a GP on the representation learned by a NN. For example, consider the RBF kernel $k(\rvx,\rvx')=\exp(-||\rvx-\rvx'||^2/2\ell^2)$ (although any other kernel can be used),  they proposed the following kernel $k_\theta(\rvx,\rvx')=\exp(-||g_\theta(\rvx)-g_\theta(\rvx')||^2/2\ell^2)$ where $g_\theta$ is a NN with parameters $\theta$. They then trained $\theta$ to maximize the log marginal likelihood $\log(p(\rvy|\rmX))$ using the closed-form expression for regression problems. Later works extended this approach to classification \citep{linderman2015dependent, wilson2016stochastic, milios2018dirichlet, achituve2021gp}. 

\begin{figure*}[!t]
    \centering
    \includegraphics[width=0.40\textwidth]{figures/p_fs_D1.png}
    \includegraphics[width=0.40\textwidth]{figures/p_fs_D1_ys.png}
\caption{Illustrative example: (Left) Points in $\gD_1$, the target function, and the GP prediction. Data contains a gap in [4,8] to demonstrate a low-confidence region.
(Right) Points in $\gD_2$ and the Gaussian objective $p(f_{*} | x_*, y_*, \gD_1)$ for each point.}
\label{fig:ill_example}
\end{figure*}

\paragraph{Infinite width networks.}
Studying the behavior of NNs in the infinite-width limit has its roots in the seminal work of \citet{neal2012bayesian}. It was shown that at initialization (with proper tuning), the distribution over functions represented by a single hidden layer NN converges to a GP as the width increases to infinity. This approach was later extended to infinite deep NNs as well \citep{LeeBNSPS18, MatthewsHRTG18}. This means that at the infinite-width limit, the Bayesian neural network inference problem is reduced to GP inference with a kernel defined by the neural network limit. In this study, we will refer to instances of this approach as the Neural Network Gaussian Process (NNGP). The kernel for a fully-connected network can be computed using the following recursive formula:
\begin{equation}
    \begin{aligned} 
    k^{(1)}(\rvx, \rvx') &= \sigma_b^2 + \sigma_w^2 \cdot \frac{\rvx^T \rvx'}{d}\\
    k^{(l+1)}(\rvx, \rvx') &= \sigma_b^2 + \sigma_w^2 \E_{f \sim \gN(0, \rmK^{(l)})}[ \phi(f(\rvx)) \phi(f(\rvx'))],
    \end{aligned} 
    \label{eq:NNGP}
\end{equation}
where $\sigma_b^2, \sigma_w^2$ are hyper-parameters which control the variances of the biases and weights respectively, $d$ and $l$ are the input dimension and layer index respectively (e.g., $\rmK^{(l)}$ denotes the kernel of the $l^{th}$ layer), and $\phi(\cdot)$ is the layer point-wise non-linear function. For some non-linear activations, such as sigmoidal, Gaussian, and Relu, the formula can be computed analytically \citep{williams1996computing, cho2009kernel}. In other cases, it can be approximated efficiently using Monte-Carlo methods as the expectation is over a two-dimensional Gaussian random variable \citep{NovakXHLASS20}. Similarly, a kernel can be derived for other NN architectures, such as CNNs and RNNs \citep{Garriga-AlonsoR19, NovakXBLYHAPS19, Yang2019scaling}.

% Similarly, one can derive the limiting kernel of the Neural Tangent Kernel (NTK), \citep{jacot2018neural}. In our experiments we found that both approaches worked well with a slight advantage to the former, so the NNGP was our preferred approach.
\paragraph{Inducing points.}
One prominent limitation of GPs is the difficulty of doing exact inference on large datasets. Assuming we have a dataset with $n$ points, exact inference requires storing and inverting an $n\times n$ matrix. This operation imposes a memory and run-time complexity of $\Omega(n^2)$. A commonly used method to improve the scalability of GPs is through the use of inducing points (e.g., \citep{ titsias2009variational}). Inducing point methods define a set of pseudo-observations of size $m \ll n$, termed inducing locations. These locations may or may not be learned as part of the optimization process. Importantly, this mechanism allows us to control the size of the matrix to invert since in order to make predictions we can make all the costly operations only on these points instead of the actual dataset. 
% In this approach, we also optimize the location and target distribution of $m$ points which are used to make predictions instead of the original data. This allows us to invert an $m\times m$ matrix instead of the original $N\times N$ matrix.

\section{Method}
We now present and explain our method. We will first describe our approach in the setting of exact GP inference (i.e., without inducing points approximation), then we will show how our framework can be easily generalized to include inducing points, i.e the sparse GP case. Incorporating inducing points into our framework allows our approach to handle a wide range of problems, from limited-sized datasets, where the overfitting of DKLs is most severe, to large-scale problems where the NNGPs are too computationally demanding to run.  It is important to stress that we do not place any constraints on the NN architecture, unlike other existing solutions  \citep{liu2021deep, mallick2021deep, ober2021promises, van2021feature}. For the sake of clarity, we will describe our method for the case of a single output. The generalization to the multi-outputs case is immediate and will be discussed afterward.



% \begin{figure}[ht]\label{fig:ill_example}
%     \centering
%     \subfigure[]{\includegraphics[scale=0.42]{figures/p_fs_D1.png}}
%     \subfigure[]{\includegraphics[scale=0.42]{figures/p_fs_D1_ys.png}}
% \caption{Illustrative example: (left) Points in $\gD_1$ (gap was artificially made to show low confidence area), the target function and the GP prediction. (right) Points in $\gD_2$ and the Gaussian objective $p(f_{*} |\gD_1,\rvx_*, y_*)$ for each point.}
% \end{figure}

\subsection{Guided Deep Kernel Learning}
\label{sec:GDKL}
Assume we are given a dataset $\mathcal{D}$. We split it into a training set $\mathcal{D}_1$ and a validation set $\mathcal{D}_2$. Denote by $p$ the NNGP model defined by an infinite-width neural network, and by $q_\theta$ the DKL model with parameters $\theta$. Given a point $\rvx_*$ we denote by $f_*$ the value of the latent GP function on $\rvx_*$, and denote by $D_{KL}$ the Kullback–Leibler (KL) divergence.

To motivate our proposed approach, we first describe two possible objectives, a Bayesian distillation objective (e.g., \citep{pensofunctional}) , and a predictive objective. Then, we present our final objective, which can be viewed as a combination of the two.

%We first consider two objectives, a predictive objective, and a Bayesian distillation objective, and then define our final objective which combines the two. 

In distillation, we wish to train $q_\theta$ to mimic $p$. A natural way to achieve this goal is with the following objective:
\begin{equation}
    \label{eq:dist_obj}
    \ell_{dist}(\theta)=\E_{\rvx_*\sim\gD_2}D_{KL}[q_\theta(f_{*} | \rvx_*, \gD_1) || p(f_{*} |\rvx_*, \gD_1)].
\end{equation}
The objective in \Eqref{eq:dist_obj} tries to match the latent distribution of the two models on an unseen data point from $\gD_2$. Training $q_\theta$ may produce a model that behaves like the Bayesian NNGP model, but unfortunately, it will also inherit the subpar predictive performance of the NNGP model.

Alternatively, we can try to optimize the predictive distribution of the DKL model: 
\begin{equation}
\label{eq:pred_obj}
    \ell_{pred}(\theta)=\E_{(\rvx_*, y_*)\sim\gD_2}[-\log~q_\theta(y_*| \rvx_*, \gD_1)].
\end{equation}
This objective tends to produce accurate predictions as it is directly optimized to predict  $y_*$. However, it will not behave like a Bayesian model. Namely, it will suffer from the same overfitting issues as the standard DKL training does \citep{lotfi2022bayesian} which will result in an overestimated confidence. Appendix \ref{app_sec:obj_fun_analysis} empirically demonstrates this claim and the previous one. We evaluated both loss terms on the UCI datasets Boston, Concrete, and Energy, and found that indeed these losses behave as we anticipated. 

A possible middle ground between these two approaches is to optimize $q_\theta(y_*| \gD_1,\rvx_*)$  to match the distribution of $p(f_{*} |\gD_1, \rvx_*,y_*)$. The key difference is that in this case the latent variable $f_*$ is also conditioned on the sample $y_*$: 
%A possible middle ground between optimizing to match the distribution $p(f_{*} |\gD_1, \rvx_*)$ and optimizing $q_\theta(y_*| \gD_1,\rvx_*)$ is to match the distribution of $p(f_{*} |\gD_1, \rvx_*,y_*)$. The difference is that in this case the latent variable $f_*$ is also conditioned on the true label $y_*$: 
%To further understand and motivate this objective we will consider two equivalent variations, proof of their equivalence is in the supplementary material. First is
\begin{equation}
\label{eq:hyb_obj}
    \E_{(\rvx_*, y_*)\sim\gD_2}D_{KL}[q_\theta(f_{*} | \rvx_*, \gD_1) || p(f_{*} | \rvx_*, y_*, \gD_1)].
\end{equation}
As the target latent distribution is conditioned on $y_*$, it will, for regression, take the form of a Gaussian centered near it, and the variance will be dependent on how confident $p(f_{*} |\rvx_*, y_*, \gD_1)$ is. In \Figref{fig:ill_example} we illustrate the usefulness of this objective on a toy problem. On the left panel we show the points in $\gD_1$, the ground truth function, as well as the posterior $p(f_{*} | x_*, \gD_1)$. We intentionally omitted points in the $[4,8]$ domain from $\gD_1$ to highlight areas where the GP is not confident. 
%While the GP prediction in that area is not accurate, it appropriately assigns high uncertainty to its prediction. On the right panel, we show the points in $\gD_2$ and for each point we show the target objective $p(f_{*} |x_*,y_*, \gD_1)$. We highlight two desired properties of our objective seen from this plot: When the GP is confident, the target is tight around  and not centered around the noisy $y_*$ samples. However, when the GP is not confident our target is centered around the noisy $y_*$ samples but with a much larger variance.
While the GP prediction in that area is not accurate, it appropriately assigns high uncertainty to its prediction. On the right panel, we show the points in $\gD_2$ and for each point, we show the target objective $p(f_{*} |x_*,y_*, \gD_1)$. We highlight two desired properties of our objective seen from this plot: When the GP is confident, the GP prediction is tight around the ground truth, and not centered around the noisy $y_*$ samples. However, when the GP is not confident $p(f_{*} |x_*,y_*, \gD_1)$ is centered around the noisy $y_*$ samples with a much larger variance. %\ia{Ethan - made small changes here. Please verify.}

In Appendix \ref{app_sec:gdkl_objective} we show that the objective in \Eqref{eq:hyb_obj} is also equivalent to the following:
\begin{equation}
    \begin{aligned} 
     &\E_{(\rvx_*, y_*)\sim\gD_2} \E_{q_\theta(f_*| \rvx_*, \gD_1)}[-\log~p(y_* | f_*)] \\ &+D_{KL}[q_\theta(f_*| \rvx_*, \gD_1) || p(f_*|\rvx_*, \gD_1)].
    \end{aligned} 
    \label{eq:pred_with_reg}
\end{equation}
This representation makes it clear that our objective is in fact a combination of \Eqref{eq:dist_obj} and \Eqref{eq:pred_obj}. Specifically, \Eqref{eq:pred_obj} can be connected to the first term in \Eqref{eq:pred_with_reg} by marginalizing over $f_*$ and using Jensen inequality. Conceptually, one can think of \Eqref{eq:pred_with_reg} as having a data term, which is comparable to the DKL marginal likelihood objective, plus a regularizer that prevents the model from over-fitting the training points.

To gain further insight into our approach, consider estimating $\log~p(y_*|\rvx_*, \gD_1)$ using variational inference. While it has an analytical solution for a Gaussian likelihood, we can set it aside and derive the following evidence lower bound (ELBO):
\begin{equation}
    \begin{aligned} 
    &\E_{(\rvx_*, y_*)\sim\gD_2} \log~p(y_* | \rvx_*, \gD_1) = \\
    &\E_{(\rvx_*, y_*)\sim\gD_2} \log~\int \frac{q_{\theta}(f_*|\rvx_*, \gD_1)}{q_{\theta}(f_*|\rvx_*, \gD_1)}p(y_* | f_*)p(f_*| \rvx_*, \gD_1)df_* \geq \\
     &\E_{(\rvx_*, y_*)\sim\gD_2} \int q_{\theta}(f_*|\rvx_*, \gD_1)\log~\frac{p(y_* | f_*)p(f_*| \rvx_*, \gD_1)}{q_{\theta}(f_*|\rvx_*, \gD_1)}df_*.
    \end{aligned} 
    \label{eq:cmll}
\end{equation}
It is not hard to see that our objective is equivalent to maximizing the ELBO. Namely, $q_\theta$ is essentially trained as a variational distribution, similar to an encoder network in variational auto-encoders (VAEs) \citep{KingmaW13}. It thus tries to ``encode'' the label, but does not directly predict it. 
%Namely, it is equivalent to the predictive distribution objective with an added regularizer which tries to make the output distribution match that of the Bayesian NNGP posterior $p$. 
%The first term, the predictive distribution, %is aligned better with the generalization of the model compared to the marginal likelihood \citep{lotfi2022bayesian} \ef{?}. While the second term serves as a regularizer which prevents the model from over-fitting the training points. As we will show empirically, our approach indeed gets the benefits of both approaches.
%This is equivalent to the summation of the two original objectives we considered, and 

\begin{algorithm}[!t] 
    \caption{ Guided Deep Kernel Learning (\textit{GDKL})}
    {\bf Input}: $\gD = (\rmX, \rvy)$ - the dataset; $\rmK$ - a pre-computed kernel of the NNGP on $\gD$; $T$ - number of training iterations; $\beta$ - a hyper-parameter that scales the KL-divergence term.\\
    {\bf Init} $\theta$, the parameters of the DKL.\\ %which include the NN parameters and possibly other hyper-parameters of the kernel or the mean function.\\
    % consider adding sparse case as well.
    {\bf For} $i = 1, ..., T$:\\
    \hspace*{3mm} $\bullet$ Randomly split $\gD$ to $\gD_1$ and $\gD_2$, s.t. $\gD = \gD_1 \cup \gD_2$ \\
    \hspace*{3mm} and $\gD_1 \cap \gD_2 = \{\emptyset\}$\\
    \hspace*{3mm} $\bullet$ Construct $\rmK_{\gD_1}$ from $\rmK$ by selecting the \\   
    \hspace*{3mm} entries of examples from $\gD_1$\\
    \hspace*{3mm} \textbf{For} all $j \in \gD_2$: \\
    \hspace*{5mm} \textcolor{teal}{\# Exact expressions in the Appendix.}\\
    \hspace*{5mm} $\bullet$ Obtain the predictive posteriors $p(\rvf_j | \rvx_j, \gD_1)$, \\
    \hspace*{5mm} and $q_{\theta} (\rvf_j | \rvx_j, \gD_1)$ \\
    \hspace*{5mm} $\bullet$  $\gL^{D_{KL}}_j \leftarrow$ $  D_{KL}[q_\theta(f_j| \rvx_j, \gD_1) || p(f_j|\rvx_j, \gD_1)]$ \\
    \hspace*{5mm} $\bullet$ $\gL^{ELL}_j \leftarrow$ $  \E_{q_\theta(f_j| \rvx_j, \gD_1)}[-\log~p(y_j | f_j)]$ \\
    \hspace*{3mm} {\bf End for}\\
    %\hspace*{5mm} - Learn $g_{\theta}$ with a classification loss\\
    \hspace*{3mm} $\bullet$ $\gL \leftarrow \frac{1}{|\gD_2|} \Sigma_{j=1}^{|\gD_2|}  \gL^{ELL}_j + \beta\cdot\gL^{D_{KL}}_j$. \\
    \hspace*{3mm} $\bullet$  Compute $\nabla_{\theta} \gL$ and perform update step.\\
    {\bf End for}
    \label{algo:GDKL}
\end{algorithm}

We now introduce two modifications to our objective. We add a  hyperparameter $\beta$ that multiplies the $D_{KL}$ term in  \Eqref{eq:pred_with_reg}. This allows us to control the balance between predictive training ($\beta=0$) and distillation ($\beta\rightarrow\infty$). Second, we utilize the fact that GPs are non-parametric and perform a random split of $\gD$ to different $\gD_1$ and $\gD_2$ sets at each iteration. We found that this approach led to better generalization compared to fixing these datasets. Hence the final objective is the following:
%since our objective is sensitive to the order of the data, it can be made invariant to it by taking an expectation over all possible splits of $\gD_1$ and $\gD_2$:

%Second, we use the fact that GP is non-parametric to use a different random split of $\gD$ into $\gD_1$ and $\gD_2$ at each iteration. We found that this led to better generalization, we note that this increases the training time as we need to recompute the posterior at each iteration. 
\begin{equation}
    \begin{aligned} L(\theta)=&\E_{\gD_1,\gD_2}\E_{(\rvx_*, y_*)\sim\gD_2}\{ \E_{q_\theta(f_*| \rvx_*, \gD_1)}[-\log~p(y_* | f_*)] \\ &+ \beta \cdot D_{KL}[q_\theta(f_*| \rvx_*, \gD_1) || p(f_*|\rvx_*, \gD_1)]\}.
    \end{aligned} 
    \label{eq:final_objective}
\end{equation}

%Note that as this objective can by computationally expensive \ef{?} to evaluate. However, we can approximate it by Monte-Carlo (MC) samples. 
Note that for the regression case, the $D_{KL}$ term and the expected log-likelihood term have a closed-form solution. In classification tasks, we can use approximations that involve a Gaussian likelihood, such as treating the classification problem as a regression problem, using transformed Dirichlet variables \citep{milios2018dirichlet}, or using the Pólya-Gamma augmentation \citep{polya_gamma, achituve2021gp}. In this study, we used the transformed Dirichlet variables technique. GDKL training procedure is illustrated in \Algref{algo:GDKL}. 

% term has a closed-form solution as it is between two (univariate) Gaussian distributions. The expected log-likelihood will also have a closed-form solution when the likelihood is Gaussian, otherwise it can be approximated using monte-Carlo (MC) samples.

% We note that the (for regression) the expected log-likelihood and the KL term have closed-form solution. See algorithm \ef{add algorithm} for details. % Concretely, we utilize the fact that GPs are non-parametric and perform a random split of $\gD$ to roughly equal sized sets $\gD_1$ and $\gD_2$ at each iteration.
%We found that this approach led to better generalization compared to fixing these datasets.%; however it may increases the training time as we need to recompute the posterior at each iteration.

Finally, we would like to note a few technical details. %First, in our objective the NN parameters appear only in one side of the KL divergence in correspondence with \citep{titsias2009variational, matthews2016sparse}. As a result, they are less prone to overfit as the play the role of variational parameters rather then model parameters \ef{?}. 
First, we can evaluate the kernel of the NNGP once and extract at each iteration only the relevant sub-matrices. Second, the extension to multi-output GPs results in additional summation over each output dimension. Third, our network and the NNGP network do not have to share the same architecture (besides the obvious difference in width) and we have complete freedom in choosing the architecture. Lastly, to make predictions on novel data points we use the full dataset $\gD$ using the standard GP formulas. Namely, when making predictions our model is as fast as standard DKL models.


\begin{figure*}[!t]
    \centering
    \includegraphics[width=0.90\textwidth]{figures/uci_small_shared.png}
\caption{Results for small UCI datasets. We report the log-likelihood (top; right is better) and RMSE (bottom; left is better) for each method over 10 splits on the training test sets. The log-likelihood of DKL on Boston is $\sim$-550.}
\label{fig:uci_small}
\end{figure*}

\subsection{Guided Deep Kernel Learning with Inducing Points}
Although the overfitting problem of DKLs is more acute in cases with limited data, it can still return overconfident predictions on large datasets. As such, we wish to extend our approach to larger datasets by incorporating inducing points. %we can extend our framework for the sparse case as well.
Denote by $\rmZ$ the set of $m$ inducing locations, and by $\rvu = \rvf(\rmZ)$ the function evaluation at these locations (i.e., the inducing variables). We follow the common practice \citep{hensman2013gaussian}, and define the posterior now as $q_{\theta}(\rvf) = \int p_{\theta}(\rvf|\rvu)q(\rvu)d\rvu$, where $p_{\theta}(\rvf|\rvu)$ is a Gaussian density according to the GP prior of the DKL, and $q(\rvu)$ is a variational Gaussian distribution with learned parameters. We note that while we omit $\rmZ$ for brevity, it plays an important role as the kernel matrix depends on $\rmZ$. Now we can plug this posterior distribution in \Eqref{eq:hyb_obj}:
\begin{equation}
    \begin{aligned} 
     \E_{\gD_1,\gD_2}\E_{(\rvx_*, y_*) \sim \gD_2} D_{KL} [q_{\theta}(f_* | \rvx_*, \rmZ) || p(f_* | \rvx_*, y_*, \gD_1)],
    \end{aligned} 
    \label{eq:hyb_obj_sparse}
\end{equation}


and obtain the objective in \Eqref{eq:final_objective} with the new posterior: 
\begin{equation}
    \begin{aligned} 
     &\E_{\gD_1,\gD_2}\E_{(\rvx_*, y_*) \sim \gD_2} \{ \E_{q_\theta(f_*| \rvx_*, \rmZ)}[-\log~p(y_* | f_*)] \\ & + \beta \cdot D_{KL}[q_\theta(f_*| \rvx_*, \rmZ) || p(f_*|\rvx_*, \gD_1)]\}.
    \end{aligned} 
    \label{eq:final_objective_sparse}
\end{equation}

A key part of scaling our objective is to allow for mini-batching. The objective in \Eqref{eq:final_objective_sparse} naturally factorizes over the data points in $\gD_2$, which leaves the $\gD_1$ terms. As we split our dataset at each iteration into a train and validation set, a simple solution is to split a random batch instead of the entire dataset. Namely, given a batch of examples $\gB$ we split it to two subsets $\gB_1$ and $\gB_2$ similar to the split we did for $\gD$ and compute the following objective:
%The objective in \Eqref{eq:final_objective_sparse} naturally factorizes over the data points in $\gD_2$, but to allow efficient training with mini-batching we need to take care of the terms that $\gD_1$ appears in. Luckily it appears only in the posterior distribution of the NNGP model, which we stress that plays the role of a regularizer. Hence, instead of evaluating this objective on all the data we evaluate it only on a batch of examples at each training iteration. Namely, given a batch of examples $\gB$ we split it to two subsets $\gB_1$ and $\gB_2$ similar to the split we did for $\gD$ and compute the following objective:
\begin{equation}
    \begin{aligned} 
     &\E_{\gB_1,\gB_2}\E_{(\rvx_*, y_*)\sim\gB_2} \{ \E_{q_\theta(f_*| \rvx_*, \rmZ)}[-\log~p(y_* | f_*)] \\ & + \beta \cdot D_{KL}[q_\theta(f_*| \rvx_*, \rmZ) || p(f_*|\rvx_*, \gB_1)]\}.
    \end{aligned} 
    \label{eq:final_objective_sparse_batch}
\end{equation}

%As before we approximate this objective with MC samples. Note, however, that unlike \Eqref{eq:final_objective} where $\gD_1$ appears in all terms of the objective, here the corresponding element $\gB_1$ appears only in the posterior distribution of the NNGP model.  %Thus, to be more data efficient, we use two samples. One with $\gB_1$ as the "observed data" and another one with $\gB_2$ as the "observed data".



In this case, the inducing locations $\rmZ$ and the variational  parameters of $q(\rvu)$ are also learned as part of the optimization process. Nevertheless, we achieve two important goals. First, the DKL model still tries to match the posterior of the label-informed NNGP model. Second, we need to evaluate the NNGP model only on the actual data points which are known in advance and can be computed beforehand. Thus, we are avoiding costly evaluations on the inducing inputs by this model. Furthermore, we can define the inducing inputs in the feature space of the NN which is much more beneficial in terms of optimization compared to the input space of the network \citep{bradshaw2017adversarial, achituve2021gp}. In Appendix \ref{app_sec:comp_cons} we provide additional computational aspects of our method. Specifically, we discuss the scaling limitations imposed by computing the NNGP kernel to train GDKL.
We argue that with GDKL this is not so much of an issue as we have a a large degree of freedom in choosing the NNGP architecture with it hurting too much the performance of GDKL. Furthermore, with smart pre-computations GDKL can be as fast as standard DKL during the training time of the NN, and not only when making predictions.


\section{Related Work}
\textbf{Bayesian NNs.} Bayesian NNs model the uncertainty over the true underline function by assuming a probability distribution over the network parameters \citep{minka2000bayesian}. Instead of solving an optimizing process for a single set of parameters, BNNs attempt to compute the Bayesian model average (BMA) \citep{wilson2020bayesian}. However, for modern NNs solving the BMA integral is computationally intractable and approximations must be used. Notable examples are the Laplace approximation \citep{mackay1992bayesianint, khan2019approximate, daxberger2021laplace}, MCMC-based methods \citep{neal2012bayesian, welling2011bayesian, chen2014stochastic, ZhangLZCW20}, and variational inference \citep{graves2011practical, blundell2015weight, kingma2015variational}. These approximations usually result in either degraded performance, specialized NN architectures, unreliable uncertainty estimation, or computational difficulties in terms of memory and time. One possible compelling alternative for making inference in parameter space is to do it in function spaces \citep{sunfunctional, WangRZZ19, rudner2021tractable,  ma2021functional}. However, these methods suffer from similar issues as standard BNNs do and may involve rough approximations. A different line of work considers the distribution over functions when using infinite-width layers in fully connected networks \citep{neal2012bayesian, LeeBNSPS18, MatthewsHRTG18, jacot2018neural}. This approach allows performing tractable Bayesian inference with NNs while avoiding the optimization difficulties associated with training them. Later,
%It turns out that as the layers tend to infinity the hypothesis classes represented by the network could be replaced with a GP prior over functions with a kernel that is determined by the architecture and the parameters variance. 
this approach was extended to other architectures and layers, such as CNNs, RNNs, attention, and batch normalization \citep{Garriga-AlonsoR19, NovakXBLYHAPS19, Yang2019scaling}. However, these approaches suffer from several drawbacks, such as reduced generalization, and costly kernel evaluation. In recent years several studies (e.g., \citep{aitchison2021deep, ober2021variational, yang2021theory}) extended this idea to introduce more flexibility to the kernel and the ability learn representation. Yet, to date, these methods suffer from reduced performance compared to standard NNs on large datasets, and they are not well adjusted to different data modalities and complex architectures. Lastly, there exists some evidence \citep{brosse2020last, kristiadi2020being} that being Bayesian, even only on the last layer, can provide the desirable benefits of the Bayesian paradigm.

\begin{figure*}[!t]
    \centering
    \includegraphics[width=0.90\textwidth]{figures/high_dim_data.png}
    \caption{Model performance on $[50, 100, 200, 400, 800]$ training data points (x-axis in log-scale). We report the log-likelihood (top) and RMSE/accuracy (bottom) on the test set for all datasets. A higher log-likelihood and accuracy, and a lower RMSE are better. All the results are based on ten random seeds. On Buzz and CTSlice we didn't report here the standard deviation of the log-likelihood for the DKL model as it was very large and impaired the visibility of the figure.}
    \label{fig:high_dim_data}
\end{figure*}

\textbf{Learning representations with GPs.} 
An alternative approach for learning in function spaces is with Gaussian processes.
%Gaussian processes are a great tool to discover rich structures in datasets and to incorporate prior knowledge to the model \citep{gp_book}.
However, Gaussian processes cannot learn a new representation of the data \citep{gordon16_DKL}. Effectively this limits them to data modalities on which standard kernels can capture similarity well. Common solutions for this problem are deep GPs \citep{damianou2013deep, salimbeni2017doubly}, and deep kernel learning \citep{calandra2016manifold, gordon16_DKL}. In this study, we build on the latter approach.
%In principle, DKLs should provide desirable properties, such as the ability and flexibility to utilize the power of any NN to learn representation while enjoying the benefits of a pure Bayesian model. 
Unfortunately, it was found that DKLs can overfit in a particular way \citep{ober2021promises}. The DKL objective will tend to correlate all data points instead of only those that convey information about each other. One way to mitigate this phenomenon is to use a fully Bayesian approach \citep{ober2021promises}. Yet, this direction inherits the challenges of working with BNNs which one would like to avoid when using DKLs. 
Simultaneously, several studies suggested methods to tackle this limitation of DKLs.  
\citet{van2021feature} proposed to do spectral normalization to the NNs parameters in architectures with residual connections following \citep{liu2020simple}. However, this method is limited to networks with residual connections only and may depend heavily on the estimation quality of the spectral norm. Also, in our experiments we often found this method to be equivalent to standard DKLs. \citet{liu2021deep} proposed to use stochastic NNs to learn the representations of examples. The first method, termed DLVKL, uses an encoder network, similar to that used in VAEs \citep{KingmaW13}. The second method, termed DLVKL-NSDE, uses stochastic differential equation flows. To use flow-based models, the feature and input spaces must have the same dimensionality which makes this method impractical for high-dimensional data.
Lastly, \citet{mallick2021deep} proposed to map data points to probability distributions using probabilistic NNs and fit a GP in that space. This method builds on particle-based optimization \citep{liu2016stein} and as such it operates on several NNs simultaneously which may be challenging for even moderate-sized NNs. We would like to note, that we expect our method to gain benefit from similar techniques (e.g., \citep{lakshminarayanan2017simple}). We leave this direction to future research endeavors.


\section{Experiments}
We evaluated GDKL on a number of benchmark regression and classification datasets, ranging from small to medium size, and low to high data dimensionality. Unless stated otherwise, in all experiments we report the mean performance (e.g., log-likelihood) along with one standard deviation over random seeds, which may include randomness in the data and the parameters. We stress that we used the same initialization for all compared methods. Note that in our evaluation we aim at showing that GDKL can obtain a strong mean prediction with good uncertainty estimation, while other methods usually fall short in at least one aspect. Indeed, as we will show, when factoring both (important) aspects GDKL is the best in most cases.
Full implementation details are given in Appendix \ref{app:exp_details}, comparison to baseline methods in terms of computational complexity are presented in Appendix \ref{app_sec:comp_cons}, and further experiments are presented in Appendix \ref{app_sec:add_exp}. 
\footnote{Our code is publicly available at \textcolor{magenta}{\url{https://github.com/IdanAchituve/GDKL}}}.

\begin{table*}[!h]
\centering
\caption{Test results on full CIFAR-10 and CIFAR-100 based on three random seeds.}
%\vskip 0.1in
\scalebox{.73}{
    \begin{tabular}{l c cccccccc cccccccc}
    \toprule
    && \multicolumn{7}{c}{CIFAR-10} && \multicolumn{7}{c}{CIFAR-100}\\
    \cmidrule(l){3-9}  \cmidrule(l){11-17}
    && ACC ($\uparrow$) && LL ($\uparrow$) && ECE ($\downarrow$) && MCE ($\downarrow$) && ACC ($\uparrow$) && LL ($\uparrow$) && ECE ($\downarrow$) && MCE ($\downarrow$)\\
    \midrule
    DKL && 95.45 $\pm$ 0.06 && -0.19 $\pm$ 0.00 && 0.03 $\pm$ 0.00 && 0.30 $\pm$ 0.01 && 77.90 $\pm$ 0.45 && -0.94 $\pm$ 0.00 && 0.08 $\pm$ 0.00 && 0.24 $\pm$ 0.02 \\
    DLVKL && \textbf{95.65 $\pm$ 0.12} && -0.18 $\pm$ 0.00 && 0.03 $\pm$ 0.00 && 0.42 $\pm$ 0.20 && 77.42 $\pm$ 0.05 && -0.95 $\pm$ 0.02 && 0.09 $\pm$ 0.00 && 0.26 $\pm$ 0.02\\
    DUE && 95.48 $\pm$ 0.09 && -0.19 $\pm$ 0.00 && 0.03 $\pm$ 0.00 && 0.32 $\pm$ 0.04 && 76.39 $\pm$ 0.23 && -0.98 $\pm$ 0.03 && 0.10 $\pm$ 0.01 && \textbf{0.16 $\pm$ 0.02}\\
    \midrule
    GDKL (Ours) && \textbf{95.67 $\pm$ 0.06} && \textbf{-0.17 $\pm$ 0.00} && \textbf{0.01 $\pm$ 0.00} && \textbf{0.27 $\pm$ 0.01} && \textbf{78.36 $\pm$ 0.19} && \textbf{-0.89 $\pm$ 0.01} && \textbf{0.06 $\pm$ 0.00} && 0.18 $\pm$ 0.01\\
    \bottomrule
    \end{tabular}
}
\label{tab:full_datasets}
\end{table*}


\subsection{Small-Sized Datasets} \label{sec:small_sized_data}
To showcase our claim that GDKL can learn from small datasets while being robust to overfitting, we first evaluated it on the three small-sized UCI benchmark datasets: Boston, Energy, and Concrete \citep{asuncion2007uci}. We compared the exact GP variant of our method to (1) \textit{DKL} - standard DKL training \citep{calandra2016manifold, gordon16_DKL}; (2) \textit{NNGP} - A GP with an NNGP  kernel \citep{LeeBNSPS18, MatthewsHRTG18}; and (3) \textit{GP-RBF} - Standard GP with an RBF kernel (without DKL). The last simple baseline is considered a strong approach on these datasets \citep{salimbeni2017doubly}, as the semantic similarity is well captured by the RBF kernel. All neural network models (NNGP, DKL, and GDKL) use a three hidden layer fully-connected network. DKL and GDKL use the same width for each layer. We follow the training protocol suggested in \citep{ober2021promises} with several modifications which are described in Appendix \ref{app:exp_details}. As customary on these datasets (e.g., \citep{salimbeni2017doubly}), we use $k$-fold cross validation with $90\%$ randomly selected data as training and the remaining $10\%$ as a held-out test set. Here we used $k=10$. We scale the inputs and outputs of each partition of the data to have zero mean and unit standard deviation based on the training part only (the output scaling is restored in evaluation). \Figref{fig:uci_small} shows the Log-Likelihood (LL) and RMSE of the compared methods. 

From the figure, we observe several findings. First, we indeed observe that standard DKL training produces a model that has the characteristics of a NN and not a Bayesian model, i.e. it has good RMSE values at train and test, but it does not reliably estimate its uncertainty, as seen from its inferior test log-likelihood values. Second, the RMSE performance of NNGP on the Energy dataset is considerably worse than other baselines, confirming our claim that NNGP can have poor predictive performance.  %leads to severe overfitting as indicated by the large gap in LL between the training and test sets. Nevertheless, this model does generalize well in terms of the mean prediction as reflected by the RMSE (which corresponds to standard NN prediction). 
%Second, the NNGP is able to fit the data well as indicated by the LL; however, it usually under-performs other baselines in terms of the mean prediction, which is inline with the literature (e.g., \cite{NovakXBLYHAPS19}).
Finally, GDKL is always comparable to the best of the two on all datasets on both metrics and is substantially less prone to overfitting. %Interestingly, GDKL is less prone to overfitting even compared to the NNGP model.


%comparable or outperforms all methods across all datasets on the two metrics. Namely, it is able to maintain (or improve) the generalization abilities of the DKL mean function, while providing a better fit to the data. Interestingly, as can be observed in the gaps between training and test performance, GDKL is less prone to overfitting even compared to the NNGP model. Also, GDKL is able to bypass the hurdle of full-batch training without resorting to stochastic mini-batching as a form of implicit regularization, as suggested in \citep{ober2021promises}.



\subsection{high-dimensional Datasets} \label{sec:high_dim_data}
Next, we expect to achieve the most benefit from DKL in settings with high dimensional data, where standard kernels do not perform as well. In these cases, the NN should be encouraged to find a low-dimensional representation over which a GP will work well. To test that scenario we considered the two regression datasets Buzz and CTSlice from the UCI repository, and the classification dataset CIFAR-10 \citep{krizhevsky2009learning}. Here, we use a subset of the training data with a  varied number of training examples from $50$ to $800$ and recorded the log-likelihood and the RMSE/accuracy of the model on the test set in each experiment. For Buzz and CTSlice we allocated $10\%$ of the data for testing, and for CIFAR-10 we use the default test split.
On the regression datasets, we compared the exact GP variant of our method to the same baselines as in \Secref{sec:small_sized_data}. On CIFAR-10, we use the Dirichlet-based likelihood function suggested in \citep{milios2018dirichlet} for inference. On this dataset, we didn't compare to the GP-RBF baseline as it works poorly on images. However, we did compare to two additional baselines: (1) \textit{DLVKL} \citep{liu2021deep} which learns a stochastic encoder network, reminiscent of VAEs \citep{KingmaW13}, to promote a regularized representation of the data; and (2) \textit{DUE} \citep{van2021feature} which applies spectral normalization on the weights with architectures that contain residual connections. Note that unlike GDKL these baselines require some modification to the NN. Here, we used a variant of the wide residual network (WRN) \citep{ZagoruykoK16} as a feature extractor in DKL models. As for the NNGP baseline (and for modeling $p$ in GDKL), we used a variant of this network without the average pooling layer as it imposes a large computational burden. The results on the three datasets are shown in \Figref{fig:high_dim_data}.

From the figures, we observe again that the DKL model overfits strongly, and in some cases, even its mean prediction is substantially lower than baseline methods. In addition, the NNGP works well on the regression datasets, but less so on real images. And finally, here as well, across all training set sizes, GDKL achieves the highest, or comparable, results in both the log-likelihood and RMSE/accuracy. In Appendix \ref{app_sec:reliability_diagrams} we also quantify the uncertainty of the models through calibration on the CIFAR-10 dataset. We compare all methods both visually using reliability diagrams and common metrics \citep{brier1950verification, guo2017calibration} on all dataset sizes. The figures show that GDKL is best calibrated across all metrics in all cases when $n \geq 200$, and on smaller dataset sizes it is second only to the NNGP model. 


\subsection{Medium-Sized, high-dimensional Datasets}
\label{sec:large_dataset}
Having established that GDKL works well in low-data regime settings, we now evaluate its performance on larger datasets in which exact inference is more challenging. We do so on the full CIFAR-10 and CIFAR-100 datasets. We compare GDKL to the standard DKL baseline, and to DLVKL and DUE which were presented in \Secref{sec:high_dim_data}. In general, we followed the protocol suggested in \citep{van2021feature} for training on the CIFAR-10 dataset having only $10$ inducing points. For CIFAR-100 we used a similar protocol with $200$ inducing points. Exact experimental details are given in Appendix \ref{app:exp_details}. Here, as well, we used a variant of the WRN for both the DKL models and the NNGP model used by GDKL. \tblref{tab:full_datasets} shows the test accuracy, log-likelihood, expected calibration error (ECE), and maximum calibration error (MCE) for both datasets. The ECE measures a weighted average distance between the classifier's confidence and accuracy, and the MCE measures the maximum instead of the average. From the table, GDKL outperforms all baselines in almost all of the cases. Note how GDKL is able to maintain and even surpass the accuracy of DKL while providing a classifier that is better calibrated.

\section{Conclusions}
In this study, we put forward a novel method for learning deep kernels. Our goal is to train deep kernels that keep the benefits of Bayesian models without sacrificing performance. To this end, we define a new training procedure that uses an infinite-width NN to guide the DKL optimization, effectively setting adaptive levels of confidence in our predictions. This objective utilizes the reliable uncertainty estimation of NNGPs to allow our model to be as confident as possible without being over-confident. Finally, we also proposed an extension of our model to incorporate inducing points. We evaluated GDKL on small to mid-sized datasets having low and high data dimensionality. We found that our method consistently generalized well to novel data points while not scarifying the Bayesian properties of it, i.e., it doesn't overfit. As a possible future research direction, it would be interesting to combine our framework with Bayesian models other than infinite-width NNs. 

\begin{acknowledgements} 
This study was funded by a grant to GC from the Israel Science Foundation (ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). IA is supported by a PhD fellowship from Bar-Ilan data science institute (BIU DSI). 
\end{acknowledgements}

\bibliography{achituve_219}

\end{document}
