%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{mathtools}
\usepackage{mathrsfs}
\usepackage{xcolor}
%\hypersetup{draft}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Amortized Inference for Gaussian Process Hyperparameters of Structured Kernels}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{Matthias Bitzer}
\author{Mona Meister}
\author{Christoph Zimmer}

% Add affiliations after the authors
\affil{%
Bosch Center for Artificial Intelligence\\ Renningen, Germany
}

  
  \begin{document}
\maketitle

\begin{abstract}
Learning the kernel parameters for Gaussian processes is often the computational bottleneck in applications such as online learning, Bayesian optimization, or active learning. Amortizing parameter inference over different datasets is a promising approach to dramatically speed up training time. However, existing methods restrict the amortized inference procedure to a fixed kernel structure. The amortization network must be redesigned manually and trained again in case a different kernel is employed, which leads to a large overhead in design time and training time. We propose amortizing kernel parameter inference over a complete \textit{kernel-structure-family} rather than a fixed kernel structure. We do that via defining an amortization network over pairs of datasets and kernel structures. This enables fast kernel inference for each element in the kernel family without retraining the amortization network. As a by-product, our amortization network is able to do fast ensembling over kernel structures. In our experiments, we show drastically reduced inference time combined with competitive test performance for a large set of kernels and datasets.
\end{abstract}
\section{Introduction}
Gaussian processes (GPs) are an important class of models that can be used in a wide range of tasks such as Bayesian optimization \citep{PracticalBO,ChemBO}, active learning \citep{algpsshapecontrol,SafeALTimeSeries,SafeALMOGP,pmlr-v206-bitzer23a}, or regression \citep{CKS}. Introducing an inductive bias for GPs is achieved by specifying the kernel structure. For example, smoothness, nonstationarity or periodicity can be induced very elegantly by configuring the corresponding kernel. 

Learning the kernel parameters is often a major computational bottleneck and is usually done via marginal likelihood maximization, also called \textit{Type-2-ML}, or via evidence-lower-bound maximization (ELBO) in sparse GP's \citep{svgp}. These methods often require hundreds of optimization steps to learn the kernel parameters. In \cite{AHGP}, this problem is circumvented by using amortized inference \citep{amortized_infer1,amortized_infer2} to predict the kernel parameters via a neural network in one step. This leads to a dramatic reduction in inference time for medium-sized datasets.

However, the method of \cite{AHGP} defines the amortization only for a fixed kernel structure. Importantly, specifying the kernel structure is a crucial design choice for GP's and is often used to induce prior knowledge of the task at hand such as smoothness, nonstationarity, linearity or periodicity. 
In case a different kernel should be used the network in \cite{AHGP} would need to be redesigned and retrained, which is a time-consuming and costly task, considering the vast space of possible kernel structures. We therefore propose amortizing the kernel inference for GP's over the combined space of kernel structures and datasets.

 We define an amortization neural network that gets as input a complete dataset and a symbolical description of the kernel, based on the kernel grammar \citep{CKS}, and outputs the learned kernel parameters. We design the neural network explicitly to cope with the natural invariances of the underlying spaces. Here, we make use of the transformer architecture \citep{AttentionIsAllYouNeed} and its equivariance properties \citep{SetTransformer}. We empirically show that our method leads to a drastic decrease in inference time, while delivering competitive predictive results on real-world datasets. Additionally, we illustrate the generality of our method via defining a fast ensembling over kernel structures that explicitly leverages our architecture. In short, our contributions are
 \begin{enumerate}
 	\item We construct an amortization neural network that is defined on the combined space of kernel structures and datasets. We explicitly incorporate invariances and equivariances of the underlying spaces in the architecture.
 	\item We empirically demonstrate the effectiveness of the amortization over several simulated and real world datasets and kernel structures.
 	\item We show the generality of our approach by enabling a fast ensembling over kernel structures.
 \end{enumerate}
We provide accompanying code at \url{https://github.com/boschresearch/Amor-Struct-GP}.

\section{Background}
In the following section we give necessary background information about Gaussian processes with a focus on the hyperparameter optimization involved. We start by introducing the standard technique to hyperparameter inference for GP's and will then consider amortized inference over multiple datasets, as proposed in \cite{AHGP}. Finally, we consider a broad kernel space over which we will define our proposed \textit{combined} amortization scheme.
\paragraph{Gaussian processes.}
Let $\mathcal{X}\subset \mathbb{R}^{d}$ be the input space for some $d\in\mathbb{N}$.
A Gaussian Process defines a distribution over mappings $f:\mathcal{X}\to \mathbb{R}$ and is fully specified via a positive-definite kernel function $k: \mathcal{X}\times \mathcal{X}\to \mathbb{R}$ and a mean function $m:\mathcal{X}\to \mathbb{R}$. It is characterized by the property, that, for any finite selection of input points $\mathbf{X}=\{x_{1},\dots,x_{n}\} \subset \mathcal{X}$ and any $n\in\mathbb{N}$, the collection of function evaluations $(f(x_{1}),\dots,f(x_{n}))^{\intercal}$ is multivariate Gaussian with mean $m(\mathbf{X}):=(m(x_{1}),\dots,m(x_{n}))^{\intercal}$ and covariance matrix $k(\mathbf{X},\mathbf{X}):=[k(x_{i},x_{j})]_{i,j=1,\dots,n}$. We write $f\sim \mathcal{GP}(m,k)$ to denote that the function $f$ is drawn from a Gaussian process.

Let $\mathcal{D}=\{(x_{i},y_{i})\in\mathbb{R}^{d+1}, i=1,\dots,n\}$ be a dataset for which we want to do regression. The typical modeling assumption for Gaussian process regression presumes a latent function $f\sim \mathcal{GP}(m,k_{\theta})$ with a Gaussian likelihood, thus, $y_{i}=f(x_{i})+\epsilon_{i}$ with $\epsilon_{i} \in \mathcal{N}(0,\sigma^{2})$. The kernel is parameterized with $\theta\in \Theta \subset \mathbb{R}^{p}$ and the complete parameter vector of the GP, including the likelihood variance $\sigma^{2}$, is given with $\phi=(\theta,\sigma^{2}) \in \Phi \subset \mathbb{R}^{p+1}$. An important property of this model is that the marginal likelihood, marginalized over the latent function $f$, can be computed analytically with
\begin{align}
\label{marg_lik}
p(\mathbf{y}|\mathbf{X},\theta,\sigma^{2})&=\int p(\mathbf{y}|f,\mathbf{X},\sigma^{2})p(f|\mathbf{X},\theta)df \\&= \mathcal{N}(\mathbf{y};m(\mathbf{X}),k_{\theta}(\mathbf{X},\mathbf{X})+\sigma^{2}\mathbf{I}). \nonumber
\end{align}
Inference of the inner parameters, which in this case is the infinite dimensional function $f$, can be done analytically. For the outer hyperparameters $\theta$ and $\sigma^2$ the classical way of training is maximizing the marginal-likelihood (\ref{marg_lik}), also called \textit{type-2 maximum likelihood}. Thus, we want to solve the following optimization problem
\begin{align}
\label{optproblem}
(\theta_{*},\sigma^{2}_{*})=\arg\max_{(\theta,\sigma^{2})\in\Phi }\mathrm{log}~p(\mathbf{y}|\mathbf{X},\theta,\sigma^{2})
\end{align}
for a given dataset $\mathcal{D}$. The optimization problem is usually solved via gradient-based optimizers like Adam or L-BFGS. Each step in the optimizer requires a calculation of the marginal-likelihood, which scales cubically in $n$. Furthermore, several hundred optimization steps might be necessary to reach convergence and, depending on the kernel and dataset, multiple restarts are necessary as the optimization problem is non-convex and might end up in a local maxima. In the next section, we will consider an alternative approach of solving (\ref{optproblem}) that only requires one forward-pass through an amortization network.

\paragraph{Parameter amortization.}
\citet{AHGP} presented an alternative method for learning the GP hyperparameters based on amortizing the inference over multiple datasets. In this method, parameter inference for GP's reduces to a prediction via an amortization neural network. The amortization network $g_{\psi}: \mathscr{D} \mapsto \Theta$ with weights $\psi$ is defined on the set of all datasets $\mathscr{D}$ meaning that for any $n\in \mathbb{N}$ and $d \in \mathbb{N}$ the dataset $\mathcal{D}=\{(x_{i},y_{i})\in\mathbb{R}^{d+1},i=1,\dots,n\}$ is part of the input set of the network, thus $\mathcal{D}\in \mathscr{D}$. The output-space is the parameter space $ \Theta$ of the respective kernel for which amortized inference should be done. In case of \citet{AHGP}, the Spectral Mixture Product (SMP) kernel is used, which consists of parameters $\theta_{j}=\{\{w_{m,j}\}_{m=1}^{M},\{\mu_{m,j}\}_{m=1}^{M},\{\sigma^{2}_{m,j}\}_{m=1}^{M}\}$ in the $j$-th dimension for some fixed $M$. The amortization network is designed via consecutive transformer blocks such that it can handle different input sizes and input dimensions of the respective dataset. %Furthermore, the network is designed to be invariant to a permutation of the dataset elements and equivariant to the shuffling of the dataset dimensions. 
The network is trained on a dataset of (synthetic) datasets $\{\mathcal{D}^{(l)}\}_{l=1}^{L}\subset \mathscr{D}$ via minimization of the mean average negative marginal likelihood of the datasets. After training, the network is used for one-shot prediction of the kernel parameters $\hat{\theta}=g_{\psi}(\mathcal{D}^{*})$ on an unseen dataset $\mathcal{D}^{*}$. 
%\begin{align*}
%-\frac{1}{L}\sum_{l=1}^{L}\frac{1}{|\mathcal{D}_{l}|}\mathrm{log}~ p\bigg(\mathbf{y}^{(l)}\bigg|\mathbf{X}^{(l)},\theta^{(l)}=g_{\psi}(\mathcal{D}^{(l)})\bigg)
%\end{align*}
%where $\theta^{(l)}=\{\theta_{j}^{(l)}\}_{j=1}^{d_{l}}$ and $d_{l}\in\mathbb{N}$ is the input-dimensionality of $\mathcal{D}^{(l)}$. 
In \cite{AHGP}, the kernel structure is fixed to the one of the SMP kernel. To use it with a different kernel, the network needs to be redesigned and retrained. For example, \citet{rehn2022amortized} changed the architecture to cope with the RBF kernel. Our goal is to do amortized inference over the \textit{combined space of datasets and kernel structures}, which drastically reduces redesign and retrain time and enables fast inference for many existing kernel structures via only one neural network.

Our amortization network consists of a dataset encoder, that is inspired by the architecture of \citet{AHGP} and a novel kernel encoder-decoder block that enables amortization over the combined input of kernel structure and dataset. Both blocks are designed to capture the natural invariances of the underlying structure.
\paragraph{Kernel space.}
\label{kernel_space}
Our goal is to define an amortization procedure over a family of kernels. To be more precise, we consider a family of \textit{structural forms of kernels}. The basis for this kernel space is the kernel grammar presented in \citet{CKS}. Here, each kernel structure is expressed as a symbolic expression $\mathcal{S}$ made of base symbols $\mathcal{B}$. The base symbols might include simple elementary kernels like the Squared-Exponential kernel represented as the symbol $\textrm{SE}$, the linear kernel as $\textrm{LIN}$ or the periodic kernel as \textrm{PER}. More complex expressions can be formed with multiplication and addition of base kernels/symbols. For example, one might construct a more complex structural form of a kernel via the expression $\mathrm{SE}\times \mathrm{LIN} + \mathrm{PER}$. %This expression is well defined as kernels are closed under addition and multiplication. 
Each expression $\mathcal{S}$ describes a structural form of a kernel - thus, the mathematical equation that governs the associated kernel $k$. The base kernels and therefore the combined expressions come with parameters $\theta$ and thus each expression has its own associated parameter space $\Theta_{\mathcal{S}}$. %Concretly, when refering to a kernel with expression $\mathcal{S}$ we actually mean the whole kernel family over its parameters $\{k_{\theta}|\theta \in \Theta_{\mathcal{S}}\}$.

The kernel grammar in \cite{CKS} considers all possible algebraic expressions of the base kernels. We consider a subset of the kernel grammar that leads to a rich kernel space on the one hand and one that can be easily represented in a neural network on the other hand.

First, we define a set of base symbols $\mathcal{B}$ which consists of a set of elementary kernels like $\mathrm{SE}, \mathrm{LIN}$ and $\mathrm{PER}$ and its two-gram multiplications like $\mathrm{SE}\times\mathrm{LIN}$, $\mathrm{SE}\times\mathrm{PER}$ and $\mathrm{LIN}\times\mathrm{PER}$. This is a similar symbol set as used in \cite{KernelIdentWithTrafo}. All base symbols are defined on single dimensions, and we denote the concrete dimension via an index, e.g. $\mathrm{SE}_{i}$ for the Squared-Exponential kernel on dimension $i$ and summarize the sets of indexed base symbols to $\mathcal{B}^{(i)}$. Our kernel space is then defined as an addition of base symbols within the dimension and a multiplication over dimensions:
$
\mathcal{S}=\prod_{i=1}^{d}\sum_{j=1}^{N_{i}} \mathcal{S}_{i,j}$ with $S_{i,j}\in \mathcal{B}^{(i)}.
$
For example, the following kernel would be part of the complete kernel space:
\begin{align}
\label{example_expression}
\overbrace{
	\underbrace{(\underbrace{\mathrm{SE}_{1}\times \mathrm{LIN}_{1}}_{\textcolor{violet}{\text{symbol of }\mathcal{B}}} +~ \mathrm{SE}_{1})}_{\textcolor{blue}{\text{Addition within dimension}}} \times ~(\mathrm{SE}_{2} + \mathrm{PER}_{2})}^{\textcolor{red}{\text{Multiplication over dimensions}}}.
\end{align} 
We denote the complete kernel space with $\mathcal{K}$. This kernel space contains popular kernels like the $\textrm{ARD-RBF}$ kernel with $\prod_{i=1}^{d}\textrm{SE}_{i}$ or the $d$-dimensional periodic kernel with $\prod_{i=1}^{d}\mathrm{PER}_{i}$. Additionally, kernels that act differently on different dimensions are included in the kernel space. %In 1D the kernel space is the same as used in () where base symbols are added to explain more variance of the data.

Depending on the kernel expression, the parameter space $\Theta_{\mathcal{S}}$ can vary significantly in dimensionality. For example, the \textrm{ARD-RBF} kernel $\mathcal{S}=\prod_{i=1}^{d}\textrm{SE}_{i}$ on dimension $d$ contains one lengthscale and variance\footnote{We use the parameterization of the base kernels from the kernel grammar \cite{CKS}. Here each base kernel in each dimension has its own variance.} parameter per dimension, such that $\Theta_{\mathcal{S}} \subset \mathbb{R}^{2d}$. The d-dimensional periodic kernel $\mathcal{S}=\prod_{i=1}^{d}\mathrm{PER}_{i}$ contains an additional feature specific period parameter such that $\Theta_{\mathcal{S}} \subset \mathbb{R}^{3d}$. Thus, being able to deal with different sizes of parameter spaces will be important for our proposed amortization scheme.
\section{Method}
We propose amortizing the kernel inference for GP's over the combined space of datasets and kernel structures. This enables fast inference for many kernels, as well as fast ensembling. For this, we construct an amortization network $(\mathscr{D},\mathcal{K}) \ni (\mathcal{D},\mathcal{S}) \mapsto g_{\psi}(\mathcal{D},\mathcal{S}) \in \Theta_{\mathcal{S}}\cup \mathbb{R}_{+}$ that maps from the combined space of datasets $\mathscr{D}$ and kernel structures $\mathcal{K}$ to the parameter space of the GP with respective kernel $\Theta_{\mathcal{S}}\cup \mathbb{R}_{+}$. The trained network is then used to one-shot predict GP parameters $(\hat{\theta}_{\mathcal{S}},\hat{\sigma}^{2})=g_{\psi}(\mathcal{D}^{*},\mathcal{S})$ of the specified kernel structure $\mathcal{S}$ for an unseen dataset $\mathcal{D}^{*}$. We denote with $\psi$ the (trainable) parameters of the amortization network. In the following subsections, we describe the architecture of the network and the learning procedure of $g_{\psi}$.
\subsection{Architecture}

\begin{figure*}[t]
	\centering
	\includegraphics[width=0.99\linewidth]{architecture_overview_2}
	\caption{a) Illustration of the full amortization network, which gets as input the dataset $\mathcal{D}$ and a kernel expression $\mathcal{S}$ decomposed in a sequence of sequences of base symbols and outputs the kernel hyperparameters $\theta$ in the respective parameter space $\Theta_{S}$. b)  The main layer used in the Kernel-Encoder-Decoder. It gets as input a sequence of vectors and a context vector and outputs a transformed sequence of vectors. The context vector enters the MLP layer.}
	\label{fig:architectureoverview}
\end{figure*}

The model gets as input a dataset $\mathcal{D}=\{(x_{i},y_{i}) \in \mathbb{R}^{d+1}| i=1,\dots,n\}$ where $n \in \mathbb{N}$ is the number of datapoints and $d \in \mathbb{N}$ is the number of input dimensions. Additionally, the model receives the kernel expression $\mathcal{S}$ as input. As $\mathcal{S}$ is a multiplication over dimension-wise sub-expressions $\mathcal{S}_{i}$, we can represent the expression $\mathcal{S}$ as a sequence of its sub-expressions $[\mathcal{S}_{1},\dots,\mathcal{S}_{d}]$. Similarly, we can decompose the expressions in each dimension into a sequence of base symbols. Thus, we represent/store the expression as a sequence of sequences of base symbols $\bigg[[B_{1}^{(i)},\dots,B^{(i)}_{N_{i}}]|i=1,\dots,d;B_{j}^{(i)}\in \mathcal{B}\bigg]$. We encode each base symbol via one-hot-encoding such that the sub-expression in each dimension $\mathcal{S}_{i}$ is represented via a sequence of vectors $\mathcal{V}_{i}=[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$ with $v_{j}^{(i)}\in \mathbb{R}^{|\mathcal{B}|}$. The whole expression is then represented via $\mathcal{V}_{\mathcal{S}}=[\mathcal{V}_{1},\dots,\mathcal{V}_{d}]$. 

Our architecture consists of three main parts, the dataset encoder $g_{D}$, that takes as input the dataset $\mathcal{D}$ and returns a sequence of dimension-wise embeddings $\mathbf{h}_{\mathcal{D}}=[\mathbf{h}_{1},\dots,\mathbf{h}_{d}]$, a kernel encoder-decoder $g_{k}(\mathbf{h}_{\mathcal{D}},\mathcal{V}_{\mathcal{S}})$ that gets as input a sequence of sequences of encoded base symbols $[\mathcal{V}_{1},\dots,\mathcal{V}_{d}]$ and the dataset embeddings $[\mathbf{h}_{1},\dots,\mathbf{h}_{d}]$ and outputs a transformed sequence of sequences of kernel embeddings $[\mathcal{V}_{1},\dots,\mathcal{V}_{d}]$ and finally an output layer that maps the kernel embeddings to the respective parameter space of the base kernels.

 We design the different parts of the architecture to cope with several symmetries. Similar to \cite{AHGP}, our network is permutation invariant to the shuffling of the dataset elements. Furthermore, the dataset encoder and the kernel encoder-decoder are equivariant to the permutation of input dimensions. Incorporating these symmetries enables generalization to datasets with sizes and input dimensions that were not present in the training phase. Lastly, the final output is invariant to a shuffling of the base symbols in each dimension, which is important as the sequences describe additions. The prediction of the network is thus not dependent on the order in which the additions are represented in $\mathcal{S}$.
 
We present the single parts of the architecture in the following paragraphs and an overview in Figure \ref{fig:architectureoverview} a).
\paragraph{Dataset-Encoder.}
The dataset encoder takes as input the dataset $\mathcal{D}$ and returns a sequence of dimension embeddings  $[\mathbf{h}_{1},\dots,\mathbf{h}_{d}]$. We utilize the encoder part of the Transformer architecture without positional encoding \citep{AttentionIsAllYouNeed,SetTransformer} in multiple parts of our architecture and refer to it as a $\textbf{Transformer}$ block. 
Each block maps a sequence of vectors to a sequence of transformed vectors $[a_{1},\dots,a_{1}]\leftarrow \textbf{Transformer}([a_{1},\dots,a_{1}])$ using multiple multi-head-self-attention layers \citep{AttentionIsAllYouNeed}. We consecutively apply $\textbf{Transformer}$ blocks to different hidden embeddings to construct a sequence of dimension embeddings  $[\mathbf{h}_{1},\dots,\mathbf{h}_{d}]$ with $\mathbf{h}_{i}\in\mathbb{R}^{2h}$. It involves the following steps:
\begin{enumerate}
	\item The dataset is divided into dimension-wise sequences $[(x_{j}^{(i)},y_{j})]_{j=1}^{n}$ where $x_{j}^{(i)}$ is the $i$-th dimension of point $x_{j}$.
	\item Each sequence $[(x_{j}^{(i)},y_{j})]_{j=1}^{n}$ is mapped element-wise via a linear layer to construct a sequence of embeddings per dimension $[h_{1}^{(i)},\dots,h_{n}^{(i)}]$ with $h_{j}^{(i)}\in\mathbb{R}^{h}$.
	\item Each sequence $[h_{j}^{(i)}]_{j=1}^{n}, i=1,\dots,d$ is given to a $\textbf{Transformer}$ block (shared over the $d$ sequences) that outputs a transformed sequence $[h_{j}^{(i)}]_{j=1}^{n}$.
	\item So far, each datapoint was only able to attend to other datapoints inside its dimension. In a next step, we create per datapoint embeddings via mean aggregation $h_{j}=\textbf{MeanAGG}([h_{j}^{(1)},\dots,h_{j}^{(d)}])$ leading to a sequence of datapoint embeddings $[h_{j}]_{j=1}^{n}$ with $h_{j}\in\mathbb{R}^{h}$.
	\item The datapoint embeddings are put into a \textbf{Transformer} block to form a transformed sequence of datapoint embeddings $[h_{1},\dots,h_{n}]$.
	\item In order to construct embeddings per dimension again, we append the datapoint embedding to the sequences of embeddings of step 3, thus $h_{j}^{(i)}\leftarrow\mathbf{Concat}(h_{j},h_{j}^{(i)})$ which results in sequences $[h_{1}^{(i)},\dots,h_{n}^{(i)}]$ with $h_{j}^{(i)}\in\mathbb{R}^{2h}$. 
	\item Each updated sequence $[h_{1}^{(i)},\dots,h_{n}^{(i)}], i=1,\dots,d$ is again given to a (shared) $\textbf{Transformer}$ block that outputs a transformed sequence $[h_{1}^{(i)},\dots,h_{n}^{(i)}]$.
	\item  In order to get dimension embeddings, we aggregate the sequence via mean aggregation to $\mathbf{h}_{i}=\textbf{MeanAGG}([h_{1}^{(i)},\dots,h_{n}^{(i)}])$ leading to a sequence of dimension embeddings $[\mathbf{h}_{i}]_{i=1}^{d}$ with $\mathbf{h}_{i}\in \mathbb{R}^{2h}$.
	\item The sequence of dimension embeddings $[\mathbf{h}_{i}]_{i=1}^{d}$ is again put through a $\textbf{Transformer}$ block to get a sequence of dimension embeddings $[\mathbf{h}_{i}]_{i=1}^{d}$ that contains shared information across dimensions.
\end{enumerate}
The encoder is very similar to the one in \citet{AHGP}. The only difference are the steps 3.~to 6. We incorporate these steps to prevent the permutation invariance to shuffling in the seperated sequences $[(x_{j}^{(i)},y_{j})]_{j=1}^{n}$, which can lead to pathologies (see Appendix C). Our encoder is still invariant to a shuffling of the dataset elements and permutation equivariant to a shuffling of the input dimensions. We give rigorous proofs in Appendix C.

\paragraph{Kernel-Encoder-Decoder.} The kernel encoder-decoder block is meant to translate the structure of the kernel given through $\mathcal{V}_{\mathcal{S}}=[\mathcal{V}_{1},\dots,\mathcal{V}_{d}]$ into transformed embeddings $\mathcal{V}_{\mathcal{S}}=[\mathcal{V}_{1},\dots,\mathcal{V}_{d}]$ that incorporate the information of the dataset and the global information about the kernel structure. These embeddings can be used to predict kernel parameters of the base symbols that are associated with each embedding element $v_{j}^{(i)}$. We call the block encoder-decoder, as the global structure of the expression $\mathcal{S}$ needs to be encoded and then, using the information of the global structure and the dataset, each embedding of base symbols $v_{j}^{(i)}$ needs to be decoded into a vector that contains information about the kernel parameters of the base symbol/kernel.

The main building block is the \textbf{Kernel-Encoder-Block} as shown in Figure \ref{fig:architectureoverview} b). This block maps a context vector $c\in\mathbb{R}^{l}$ and a sequence of vectors $[v_{1},\dots,v_{M}]$ with $v_{j}\in\mathbb{R}^{h}$ to a transformed sequence of vectors $[v_{1},\dots,v_{M}]$ with $v_{j}\in\mathbb{R}^{h}$. It first applies self-attention to the sequence, followed by a concatenation of the context vector to the input of the element-wise multi-layer-perceptron (MLP) layer. Given a context vector $c$, this layer is permutation-equivariant to a shuffling of the sequence.

The \textbf{Kernel Encoder-Decoder} consists of the following steps:
\begin{enumerate}
	\item Each sequence of base kernel embeddings per dimension $[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$ is given to a (shared) stack of \textbf{Kernel-Encoder-Block} layers with the dataset embedding of the respective dimension $\mathbf{h}_{i}$ as context vector. The output is a transformed sequence $[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$. 
	\item So far, only the information of the base kernels along one dimension is shared. Thus, we form dimension-wise kernel embeddings via mean aggregation to $\mathbf{v}_{i}=\textbf{MeanAGG}([v_{1}^{(i)},\dots,v_{n}^{(i)}])$ leading to the sequence $[\mathbf{v}_{1},\dots,\mathbf{v}_{d}]$ with $\mathbf{v}_{i}\in\mathbb{R}^{h}$. Each element in the sequence is an embedding of the kernel inside dimension $i$.
	\item To form a shared (global) representation of the kernel, we put the sequence of dimension-wise kernel embeddings to a \textbf{Transformer} block and receive a transformed sequence of dimension-wise kernel embeddings $[\mathbf{v}_{1},\dots,\mathbf{v}_{d}]$.
	\item Finally, we apply again a (shared) stack of \textbf{Kernel-Encoder-Block} layers to the base kernel embeddings per dimension $[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$ with extended context vector $c_{i}=\mathbf{Concat}(\mathbf{h}_{i},\mathbf{v}_{i})$ such that the shared kernel representation as well as the dataset encoding are part of the context. This gives us the final sequence of kernel embeddings per dimension  $[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$.
\end{enumerate}
%\begin{figure*}[t]
%	\centering
%	\includegraphics[width=0.80\linewidth]{toy_data2_smaller}
%	\caption{Illustration of the predictive distributions. Each plot contains a ground truth function (red-line) drawn from a GP with kernel $\mathcal{S}$ and hyperparameter $\phi^{*}_{\mathcal{S}}$. Noisy datapoints from the ground truth function are shown in green. We show the resulting predictive distribution with predicted GP parameters $\phi_{\mathcal{S}}=g_{\psi}(\mathcal{D},\mathcal{S})$.}
%	\label{fig:toydata}
%\end{figure*}
\begin{figure*}[t]
	\centering
	\includegraphics[width=0.99\linewidth]{toy_data_and_simulated_3}
	\caption{In a) each column contains a ground truth function (red-line) drawn from a GP with kernel $\mathcal{S}$ and hyperparameter $\phi^{*}_{\mathcal{S}}$. Noisy datapoints from the ground truth function are shown in green.  The upper row shows the resulting predictive distribution with predicted GP parameters $\hat{\phi}_{\mathcal{S}}=g_{\psi}(\mathcal{D},\mathcal{S})$ and the lower row shows the predictive distribution with ground truth hyperparameter $\phi^{*}_{\mathcal{S}}$. In b) we show boxplots of the RMSE and NLL scores measured on 200 unseen, simulated dataset-kernel pairs $(\tilde{\mathcal{D}}_{l},\mathcal{S}_{l})$ for our method and for a GP with Type-2-ML inference. The datasets are sampled from the same distribution as used for training.}
	\label{fig:toydata}
\end{figure*}
\paragraph{Output-Layer.}
The final part of the architecture is the prediction head for the kernel parameters. For a given expression $\mathcal{S}$, this layer has as output space the corresponding parameters space $\Theta_{\mathcal{S}}$. It gets as input the kernel embeddings per dimension  $[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$ of the \textbf{Kernel Encoder-Decoder}. Each embedding $v_{j}^{(i)}$ is associated with one base-symbol $B_{j}^{(i)}$ and each base-symbol has its own, fixed parameter space $\Theta_{B_{j}^{(i)}}$, like for example $\Theta_{SE}\subset\mathbb{R}^{2}$. We therefore realize the final mapping to $\Theta_{\mathcal{S}}$ via mapping each symbol related embedding $v_{j}^{(i)}\in\mathbb{R}^{h}$ to the respective parameter space of the base-symbol. We do this via separate \textbf{MLP} blocks for each base-symbol (more details in Appendix C).

In order to get an end-to-end amortization network, we also need a prediction for the likelihood variance. The variance depends on the kernel choice and the dataset. Thus, we form global embeddings of the kernel, via mean aggregation of all kernel embeddings per dimension  $[v_{1}^{(i)},\dots,v_{N_{i}}^{(i)}]$, and of the dataset via mean aggregation of the dimension embeddings  $[\mathbf{h}_{1},\dots,\mathbf{h}_{d}]$. We concatenate both global embeddings and use an \textbf{MLP} block to predict the noise variance.

In summary, our network predicts the kernel parameters and noise variance for a given dataset $\mathcal{D}$ and expression $\mathcal{S}$:
\begin{align}
(\hat{\theta}_{\mathcal{S}},\hat{\sigma}^{2})=g_{\psi}(\mathcal{D},\mathcal{S}).
\end{align}
It accounts for the natural invariances/equivariances of the respective spaces, which we elaborate in Appendix C.

\paragraph{Computational complexity of one forward pass.} One prediction of the kernel parameters via one forward pass scales with $\mathcal{O}(n^{2}+d^{2}+l^{2})$ where $n$ is the number of datapoints and $d$ the number of dimensions in the input dataset $\mathcal{D}$ and $l=\max(N_{1},\dots,N_{d})$ is the maximum number of symbols of the kernel sub-expressions $\mathcal{S}_{i}$ in the dimensions $i=1,\dots,d$. This follows directly from the quadratic complexity (in the sequence length) of the multi-head-self-attention layer. This complexity could be reduced via the usage of sparse attention layers \citep{sparseTransformer}.
\subsection{Training Procedure}
Our objective is to train a general purpose prediction network $g(\mathcal{D},\mathcal{S})$ that can act on, in principle, all (medium-sized) datasets and all expressions $\mathcal{S}$ in the described kernel space. To conquer this challenge with enough data, we train our network purely on simulated datasets. We reflect the variety of inputs via sampling pairs $(\mathcal{D}_{l},\mathcal{S}_{l})$ from a broad distributions $(\mathcal{D}_{l},\mathcal{S}_{l}) \sim p(\mathcal{D},\mathcal{S})$. Given a dataset of sampled dataset-kernel pairs $\{(\mathcal{D}_{l},\mathcal{S}_{l})\}_{l=1}^{L}$, we utilize the average mean negative marginal-likelihood
\begin{align}
\label{main_loss} 
&\mathcal{L}(\psi,\{(\mathcal{D}_{l},\mathcal{S}_{l})\}_{l=1}^{L})\\&=-\frac{1}{L}\sum_{l=1}^{L} \frac{1}{|\mathcal{D}_{l}|} \mathrm{log}~p\bigg(\mathbf{y}_{l}\bigg|\mathbf{X}_{l},(\theta_{l},\sigma_{l}^{2})=g_{\psi}(\mathcal{D}_{l},\mathcal{S}_{l})\bigg)\nonumber
\end{align}
as loss-function. This reflects our goal to train a network that resembles the marginal-likelihood optimization of the kernel hyperparameters for a given kernel structure and a given dataset.

\paragraph{Sampling distribution.} We sample $(\mathcal{D},\mathcal{S})$ using the following scheme (we give a sketch here, details on the utilized distributions/priors can be found in Appendix A). First, we draw the number of input-dimensions $d$ and datapoints $n$. Given $d$ we draw a kernel expression $\mathcal{S}$, where we draw the subexpressions $\mathcal{S}_{i}$ independently of each other. Each subexpression can have a different number of base symbols. Each base-symbol/base-kernel comes with a prior on its hyperparameters. In order to generate a dataset $\mathcal{D}$ that stems from the induced prior in function space of $\mathcal{S}$, we sample from the hyperparameter prior $\theta\sim p_{\mathcal{S}}(\theta)$ with $\theta \in \Theta_{\mathcal{S}}$. We use broad Gamma priors for the kernel parameters. Next, we draw the input set $\mathbf{X}=\{x_{1},\dots,x_{n}\}$ uniformly from $[0,1]^{d}$. Finally, we draw the observations from the GP via $\mathbf{y} \sim \mathcal{N}(\mathbf{0},k_{\mathcal{S},\theta}(\mathbf{X},\mathbf{X})+\sigma^{2}\mathbf{I})$, where $\sigma^{2}\sim p(\sigma^{2})$.

 When constructing the pair $(\mathcal{D},\mathcal{S})$, we distinguish two modes. The first mode is that $\mathcal{D}$ is sampled from the induced prior of  $\mathcal{S}$ - we refer to this mode as the \textit{positive} sample. For the second mode, we sample $\mathcal{D}$ using a different expression $\tilde{\mathcal{S}}$ - we refer to this mode as the \textit{negative} sample. The reason for these two modes is that we cannot assume that only datasets from the induced prior of $\mathcal{S}$ will be used as input to the prediction network. There will always be a misspecification of the kernel. %This is considered with the \textit{negative} samples.%, where $\mathcal{D}$ is not a sample from the function space distribution that is induced by $\mathcal{S}$. We sample each mode with probability $0.5$.
%\begin{figure*}[t]
%	\centering
%	\includegraphics[width=1.0\linewidth]{../main_results_and_training2}
%	\caption{Training curves and main results. In a) the training loss is shown for the first phase of the training procedure. In b) the average difference to the prediction of a type-2 ML GP on unseen simulated dataset-kernel pairs is shown over the training procedure. In c) our main evaluation is illustred. The left plot shows the RMSE scores of each method for held-out test datapoints for each dataset. For each dataset several kernels $\{\mathcal{S}^{(1)},\dots,\mathcal{S}^{(m)}\}$ are evaluated and the error-bars shows the $0.2$ and $0.8$ percentiles over the RMSE scores of the different kernels. On the right, the ratios of inference times to our method is shown in log-scale. The error-bars are again the percentiles of the inference times for the different kernels. }
%	\label{fig:mainresults}
%\end{figure*}

\paragraph{Training parameters.} During training, we sample each batch $\{(\mathcal{D}_{l},\mathcal{S}_{l})\}_{l=1}^{L}$ of size $L$ on-the-fly from the sampling distribution $p(\mathcal{D},\mathcal{S})$. This enables processing a huge corpus of dataset-kernel pairs. We employ $\textrm{RAdam}$ \citep{RADAM} as optimizer with a constant lengthscale.
\begin{figure*}[t]
	\centering
	\includegraphics[width=0.99\linewidth]{final_results2}
	\caption{In a) the RMSE scores of each method are shown for held-out test datapoints. For each dataset several kernels $\{\mathcal{S}^{(1)},\dots,\mathcal{S}^{(m)}\}$ are evaluated where each bar shows the median RMSE value and the error-bars show the 20th and 80th percentiles of the RMSE scores of the different kernels. In b) the corresponding ratios of inference times to our method is shown in log-scale.}
	\label{fig:mainresultssingle}
\end{figure*}
\paragraph{Noise variance fine-tuning.}
The noise level is a crucial property of a dataset and determines %the values of almost all GP parameters and the 
the predictive performance significantly. %Thus, predicting the noise variance corretly is very important. 
We therefore do a dedicated fine-tuning phase after the initial training phase of minimizing the negative marginal-likelihood $\mathcal{L}(\psi,\{(\mathcal{D}_{l},\mathcal{S}_{l})\}_{l=1}^{L})$ in (\ref{main_loss}). We do the fine-tuning via minimizing the extended loss
\begin{align}
%\label{main_loss}
\alpha\mathcal{L}(\psi,\{(\mathcal{D}_{l},\mathcal{S}_{l})\}_{l=1}^{L})+\beta \frac{1}{L}\Vert \sigma^{*}_{1:L} - \hat{\sigma}_{1:L} \Vert_{2}^{2},
\end{align}
where we additionally regularize the prediction of the noise-variances $\hat{\sigma}_{1:L}\in\mathbb{R}^{L}_{+}$ to be close to the known ground-truth noise-variances $\sigma^{*}_{1:L}\in\mathbb{R}^{L}_{+}$.  Importantly, we only draw \textit{positive} samples $(\mathcal{D}_{l},\mathcal{S}_{l})$. We call this step a fine-tuning step, as we only do it on significantly fewer datasets than in the first phase. We observe a major increase in robustness of the noise-prediction on real-world datasets. In Appendix B we show the impact of the fine-tuning.

\section{Related Work}
\paragraph{Amortized inference with fixed kernel structure.}
Our method extends the work of \cite{AHGP} to enable amortization over the combined space of datasets and kernel structures. Compared to \cite{AHGP} our network is not restricted to a single kernel, meaning that practitioners can insert any kernel structure $\mathcal{S}$ from our space and can utilize the amortization network out of the box for parameter inference via $\hat{\phi}_{\mathcal{S}}=g_{\psi}(\mathcal{D},\mathcal{S})$. %This means, main advantage then is    does not need to be redesigned and retrained to be used with a different kernel. 
From a software perspective, we can view the work of \cite{AHGP} as an emulation of the inference functionality of a typical GP framework for a fixed kernel through a large neural network. Our method enlarges this emulation further via rendering the kernel configurable directly in the neural network. 

\paragraph{Amortized model selection.} In \citet{KernelIdentWithTrafo} an amortized structure selection is proposed. Here, the kernel structure $\mathcal{S}$ itself is predicted as a sequence of tokens via an amortization network $g(\mathcal{D})$. After selecting $\mathcal{S}$ the hyperparameters of the kernel needs to be trained via maximum likelihood. Our method complements their method, as after taking care that the kernel spaces are identical, one might use our method to predict the kernel parameters of the selected kernel structure. This would amortize the full pipeline of kernel selection and hyperparameter optimization.

\paragraph{Kernel grammar.}
Our input space is based on the kernel grammar in \cite{CKS}. The kernel grammar is part of a greater research line called the \textit{Automatic Statistician} \citep{CKS,AutomaticStatistician,bitzer2022structural}, which tries to infer interpretable GP models and dataset description from data in an automatic way. Our work can be used to enhance the GP parameter inference for each GP representation in the search procedure.

\paragraph{Hypernetworks.}
Our method can be seen as a hypernetwork \citep{hypernetworks} for Gaussian process models. Usually, a hypernetwork predicts neural network weights from some sort of input. The input can be a hyperparameter \citep{hypernetworkHyperparamInp} or a latent representation of a layer \citep{hypernetworks}. Notably, \cite{hypernetworkUnknownArch} proposed a hypernetwork to predict the neural network weights for a fixed dataset with a description of the architecture as input, which can be seen as a related task compared to ours in the neural network world. We note that our method predicts Gaussian process hyperparameters rather than neural network weights and amortizes over the combined space of datasets and kernel descriptions.

\paragraph{Prior-Fitted-Networks.}
In \cite{PFN} a method called \textit{Prior-Fitted-Networks (PFN)} is proposed. Here a transformer is used to form an end-to-end prediction from a dataset $\mathcal{D}$ and a test point $x^{*}$ to a predictive distribution $p(y^{*}|x^{*},\mathcal{D})$ of a given prior. The difference to our approach is two-fold. Firstly, we predict a full GP via inferring its parameters. A PFN predicts only slices of the predictive distribution. Secondly, we render the prior configurable via making the kernel configurable. In this way, practitioners can include prior knowledge of the task at hand.
\begin{table*}[t]
\begin{center}
\caption[RMSE values on real-world datasets (vs AHGP-SE-ARD)]{Average RMSE values over 20 train/test splits for each dataset using our method (GP-Amortized) equipped with an RBF kernel and using AHGP-SE-ARD. Marked values (*) are significantly smaller measured via a two-sample t-test ($\alpha=0.025$).}	
\label{tab:ahgp}
\begin{tabular}{lrrrrrrr}\toprule \textbf{RMSE} &    Energy &  Concrete &  Airfoil &  Airline &  PowerPlant &     Yacht &      Wine \\\midrule 
	GP-Amortized &  0.0830 &  0.3635 &  *0.4484 &  0.2866 &    0.2467 &  0.0606 &  ~0.9287 \\
	AHGP-SE-ARD &  0.0800 &  0.3755 &  ~0.5965 &  0.2722 &    0.2459 &  0.0602 &  *0.7962 \\\bottomrule
	\end{tabular}
\end{center}	
	
\end{table*}
\section{Experiments}
In the following section, we empirically analyze our amortization scheme on regression benchmarks and compare against common methods to do GP inference. First, we illustrate the prediction capabilities on toy datasets. Subsequently, we analyze the learning behavior and the performance on real-world datasets. In the last subsection, we analyze ensembling as a possible plug-and-play extension of our method.

\paragraph{Experimental setting.}
We train our amortization network on a stream of mini-batches of size $L=128$ for a total of 9 million dataset-kernel pairs $(\mathcal{D}_{l},\mathcal{S}_{l})$ in the initial training phase. %This phase is executed on a Tesla A100 GPU and lasts approximately 4 days. 
 We continue the training with the noise-variance fine-tuning phase, which is performed over 200.000 dataset-kernel pairs. In both phases, we utilize $\mathrm{SE},\mathrm{LIN}$ and $\mathrm{PER}$ and its 2-gram multiplications like, e.g. $\mathrm{SE}\times\mathrm{LIN}$ as base-symbols and simulate datasets of size $n\in[10,250]$ and input dimension $d\in[1,8]$. %Furthermore, we sample the number of addends in each subexpression $\mathcal{S}_{i}$ from a geometric distribution with $p=0.6$. 
 Further training details can be found in Appendix A. %For the training procedure we use $\mathrm{RAdam}$ as optimizer with a learning rate of $\mathrm{lr}=\mathrm{2\times 10^{-5}}$. Further training details can be found in the Appendix.

\paragraph{Performance on simulated data.}
We evaluate the final amortization network $g_{\psi}$ via its inference capabilities on unseen datasets $\tilde{\mathcal{D}}$. Each unseen dataset $\tilde{\mathcal{D}}$ is splitted into training $\tilde{\mathcal{D}}_{train}$ and test dataset $\tilde{\mathcal{D}}_{test}$ and we evaluate for a given kernel expression $\mathcal{S}$ the predictive performance of the final GP with predicted hyperparameters $\hat{\phi}_{\mathcal{S}}=g_{\psi}(\tilde{\mathcal{D}}_{train},\mathcal{S})$ on $\tilde{\mathcal{D}}_{test}$. We give example predictive distributions on simulated datasets in Figure \ref{fig:toydata} a). %Here, the ground-truth function is drawn from the same kernel that is given as input and we compare against the predictive distribution given via the ground-truth hyperparameters. 
We observe that our amortization network leads to accurate predictive distributions - notably only via evaluating a neural network to predict the kernel parameters. We add several more prediction plots in Appendix B, including plots with misspecified kernels, small datasets and more complex kernels.

Furthermore, we show test-RMSE and test-NLL scores of 200 unseen, simulated dataset-kernel pairs $(\tilde{\mathcal{D}}_{l},\mathcal{S}_{l})$ in Figure \ref{fig:toydata} b) for our method and for a GP with Type-2-ML hyperparameter inference. The datasets are sampled from the same distribution as used for the initial training phase. We observe that our method leads to very similar RMSE and NLL scores compared to Type-2-ML inference. This illustrates the quality of the predicted hyperparameters. In Appendix B, we analyze the predictive performance of both approaches for different number of training datapoints.
\paragraph{Regression benchmarks.}
Our main evaluation considers seven real-world datasets. We split each dataset into a training and test set (we set $\mathrm{n_{train}}=500$ for all datasets except the smaller datasets $\mathrm{Airline}$ and $\mathrm{Yacht}$, details in Appendix A) and evaluate the predictive performance over a set of $m=24$ kernels $\{\mathcal{S}^{(1)},\dots,\mathcal{S}^{(m)}\}$ that are drawn randomly. %where we apply the same subexpression $\mathcal{S}_{i}$ over all dimensions e.g. $(\mathrm{SE}_{1}+\mathrm{LIN}_{1})\times (\mathrm{SE}_{2}+\mathrm{LIN}_{2})$ applies $\mathcal{S}_{i}=\mathrm{SE}_{i}+\mathrm{LIN}_{i}$ to each dimension. 
We compare against the standard way of GP hyperparameter inference via Type-2-ML. We consider three versions of Type-2-ML two with a single run from initial parameters where we optimize via Adam [GP-ML] and via L-BFGS [GP-ML (L-BFGS)] and one with 10 randomized restarts optimized with Adam [GP-ML (multi-start)]. For GP-ML and GP-ML (multi-start), we maximize the marginal likelihood via \textrm{Adam} with $\mathrm{lr}=0.1$ for 150 iterations and early-stop once the loss is converged. Furthermore, we compare against Sparse-Variational GP (SVGP) \cite{svgp}, where we use $I=0.5n$ inducing points. For more details see Appendix A.

 In Figure \ref{fig:mainresultssingle} a), we show the resulting test-RMSE scores of the respective method. Here, the black bars correspond to percentiles of the RMSE scores of the set of different kernels $\{\mathcal{S}^{(i)}\}_{i=1}^{m}$. We observe that our methods leads to comparable predictive performance to  Type-2-ML. Importantly, we observe in Figure \ref{fig:mainresultssingle} b) that our proposed method leads to a \textit{major decrease in inference time for a diverse set of kernels $\{\mathcal{S}^{(i)}\}_{i=1}^{m}$}. This goes so far that for certain kernel structures our method is 800 times faster than Type-2-ML with restart. We show NLL scores in Appendix A.  %At the same time it reaches a comparable predictive performance to  ML-Type 2.
\begin{figure*}[t]
	\centering
	\includegraphics[width=0.99\linewidth]{ensemble_plots}
	\caption{Fast ensembling. a) Predictive distribution resulting from a fast ensemble of five kernel structures (in blue). In red, we show the predictive means of the single GP's. b) RMSE and inference time ratios for the fast ensembling based on our amortization model (error bars of standard amortization are over the same 24 kernels that are in the ensemble). c) Ratios of inference time between batch and non-batch ensembling shown over different dataset sizes.}
	\label{fig:ensembleplots}
\end{figure*}
\paragraph{Comparison to AHGP.} We further compare against AHGP on a fixed kernel. We note that for any new kernel structure the method of \cite{AHGP} needs to be redesigned and retrained - rendering it less flexible than our method. We investigate the performance differences on the ARD-RBF kernel, which is part of our considered kernel space. Here, we use the adapted version of the AHGP architecture to the RBF kernel presented in \cite{rehn2022amortized} (AHGP-SE-ARD). We trained AHGP-SE-ARD on the same data distribution as our method for 9 million datasets and configured the architecture to have approximately the same capacity (see Appendix A for details). For evaluation/inference on a dataset $\tilde{\mathcal{D}}$ we equip our method with the ARD-RBF kernel as input, thus $\hat{\phi}=g_{\psi}(\tilde{\mathcal{D},}\prod_{i=1}^{d}\textrm{SE}_{i})$. We show mean test-RMSE scores on the real-world datasets for both methods in Table \ref{tab:ahgp}. It can be seen that on five out of seven datasets the performance of both method is very similar - indicating that amortization over kernel structures does not induce significant performance drops compared to using a fixed kernel. On \textit{Airfoil} our method is significantly better and on \textit{Wine} the AHGP-SE-ARD method. We think that this difference might eventually vanish with more datasets in the training phases.

\paragraph{Fast ensembling.} Our method offers a general inference machine for GP hyperparameters over a structured kernel space. This can be utilized to construct ensembles in a computationally efficient way. We construct a Bayesian-Model-Average (BMA) over a set of kernel structures $\{\mathcal{S}^{(1)},\dots,\mathcal{S}^{(m)}\}$ , where we use the predicted marginal likelihood values as ensemble weights (see Appendix A). In \ref{fig:ensembleplots} a) we show an example of the fast ensembling over five kernel structures. Importantly, we observe a high diversity in the predictive mean functions, which is a desirable property for an ensemble. In \ref{fig:ensembleplots} b) we show the predictive performance of the fast ensemble over the set of 24 kernels and compare it to the range of predictive scores of the non-ensemble predictions (we show the three datasets that had the biggest diversity of RMSE scores over kernels). Furthermore, we compare against an ensemble, with Type-2-ML inferred GP parameters. Firstly, we observe that ensembling results in the expected performance gain over standard predictions and secondly, we see that our method offers a drastic speed-up compared to the conventional method. 

Importantly, our architecture is particularly tangled to process this kind of ensembles over kernel structures, as it only needs to process the dataset-encoder once and can process multiple kernels in a batch through the kernel-encoder-decoder. This leads to an additional speed-up for inferring ensembles that can be seen in Figure  \ref{fig:ensembleplots} c).% where we show the ratio of inference times of batch vs. naive non-batch inference over different dataset sizes. %We observe an increase of efficiency gain for bigger datasets, which results from the dataset encoder consuming a bigger fraction of computation for bigger datasets. We give more experimental details in the Appendix.

\paragraph{Limitations.}
We note that predicting kernel parameters instead of optimizing them also has limitations. In particular, our method tends to favor conservative predictive distributions with broader prediction intervals out-of-data (see Appendix B). While this is often beneficial, it might not be desired, for example for extrapolation tasks with the periodic kernel where our method favors explaining the data via the lengthscale of the periodic kernel rather than the periodicity (see qualitative analysis in Appendix B). %We provide a detailed discussion of potential limitations in Appendix.

\section{Conclusion}
In this work, we proposed an amortization scheme for the hyperparameters of Gaussian process models. The main novelty is to amortize over the combined space of datasets and kernel structures. Our proposed amortization network is explicitly designed to cope with the respective symmetries of this task. In our experiments, we show a drastic speed-up in inference time for diverse kernel structures. At the same time, we show that our method can predict kernel parameters that lead to competitive predictions on real world data.

\bibliography{bitzer_142}
\end{document}
