% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}



% \doublespacing
% Set page size and margins
% Replace `letterpaper' with `a4paper' for UK/EU standard size
% \usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}

% Useful packages
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage{float} %% to make figures in SM be correctly placed.
% \usepackage[colorlinks=true, allcolors=black]{hyperref}

\usepackage{algorithm}
\usepackage{algpseudocode}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
% \usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
    \usepackage{xcolor}         % colors
\usepackage{subcaption}
% \RequirePackage[hyphens]{url}
%% as per the requirement new theorem styles can be included as shown below
% \theoremstyle{thmstyleone}%
\newtheorem{theorem}{Theorem}
%  meant for continuous numbers
\newtheorem*{theorem*}{Theorem}
%%\newtheorem{theorem}{Theorem}[section]% meant for sectionwise numbers
%% optional argument [theorem] produces theorem numbering sequence instead of independent numbers for Proposition
%% \newtheorem{proposition}[theorem]{Proposition}% 
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}% to get separate numbers for theorem and proposition etc.

\newcommand{\JL}[1]{\textcolor{red}{\textsf{#1}}}
\newcommand{\AB}[1]{\textcolor{orange}{\textsf{#1}}}

%\theoremstyle{thmstyletwo}%
\newtheorem{example}{Example}%
\newtheorem{remark}{Remark}%
\newtheorem{lemma}{Lemma}%

% \usepackage{natbib}

% following submission instructions
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}
\bibliographystyle{abbrvnat}


\allowdisplaybreaks

% to prevent urls to run over in the references:
\usepackage{url}
\def\UrlBreaks{\do\/\do-} 

\renewcommand\harvardurl[1]{\textbf{URL:} \url{#1}}
\usepackage{hyperref}       % hyperlinks
\hypersetup{
    colorlinks,
    linkcolor={blue!50!black},
    citecolor={blue!50!black},
    urlcolor={blue!50!black}
}

% \theoremstyle{thmstylethree}%
\newtheorem{definition}{Definition}%


\begin{document}

\title{Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance}

\author[1,2]{\href{mailto:jorge.loria@aalto.fi}{Jorge Lor\'ia}}{}
\author[1]{Anindya Bhadra}

\affil[1]{%
    Department of Statistics\\
    Purdue University\\
    West Lafayette, Indiana, USA
}
\affil[2]{%
    Department of Computer Science\\
    Aalto University\\
    Finland
}
%Unnecessary comments removed below
 \maketitle
 \begin{abstract}
From the classical and influential works of \citet{neal1996priors}, it is known that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, \emph{when the network weights have bounded prior variance}. Neal's result has been  extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an $\alpha$-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely  unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally feasible procedure for posterior inference, using a \emph{conditionally Gaussian} representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.
\end{abstract}
	
\section{INTRODUCTION}
Gaussian processes (GPs) have been studied as the infinite width limit of Bayesian neural networks with priors on network weights that have finite variance \citep{neal1996priors,williams1996}. This presents some key advantages over Bayesian neural networks with finite widths that usually require computation intensive Markov chain Monte Carlo (MCMC) posterior calculations \citep{neal1996priors} or variational approximations \citep[Chapter 19]{Goodfellow}; in contrast to straightforward posterior inference and probabilistic uncertainty quantification afforded by the GP machinery \citep{williams2006gaussian}. In this sense, the work of \citet{neal1996priors} is foundational. The technical reason for this convergence to a GP is due to an application of the central limit theorem under the bounded second moment condition. More specifically, given an $I$ dimensional input $\mathbf{x}$ and a one-dimensional output $y(\mathbf{x})$, a $K$ layer feedforward deep neural network (DNN) with $K-1$ hidden layers is defined by the recursion:
\small
\begin{align}
z_j^{(l+1)}(\mathbf{x}) &= g\left(b_j^{(l)} + \sum_{i=1}^{p_{l}} w_{ij}^{(l)} z_i^{(l)}(\mathbf{x})\right), \quad l=1,\ldots, K-1, \label{eq:nn1}\\
y(\mathbf{x}) &= \sum_{j=1}^{p_K} w_{j}^{(K)} z_j^{(K)}(\mathbf{x}), \label{eq:nn2}
\end{align}
\normalsize
where $z^{(1)}\equiv \mathbf{x},\; p_1=I,\; p_K=D$ and $g(\cdot)$ is a nonlinear activation function. Thus, the network repeatedly applies a linear transformation to the inputs at each layer, before passing it through a nonlinear activation function. Sometimes a nonlinear transformation is also applied to the final hidden layer to the output layer, but in this paper it is assumed the output is a linear function of the last hidden layer. \citet{neal1996priors} considers the case of a Bayesian neural network with a single hidden layer, i.e., $K=2$. So long as the hidden to output weights $w^{(2)}$ are independent and identically distributed Gaussian, or at least, have a common bounded variance given by $c/p_2$ for some $c>0$, and $g(\cdot)$ is bounded, an application of the classical central limit theorem shows the network converges to a GP as the number of hidden nodes $p_2\to\infty$.  

\subsection{Related Works}
The foundational work of \citet{neal1996priors} was followed by an explicit computation of some of the kernels obtained from this limiting process \citep{williams1996}. 
Recently, Neal's result has been extended to prove fully connected multi-layer feedforward networks \citep{lee2018deep, matthews2018} and convolutional neural networks \citep{garriga-alonso2018deep,novak2019bayesian} also converge to GPs. The Tensor Program of \citet{yang2021tensor} has successfully extended these results to feedforward and recurrent networks of \emph{any architecture.} This is useful for uncertainty quantification by designing emulators for deep neural networks (DNNs) based on GPs, since the behavior of finite-dimensional DNNs for direct uncertainty quantification is much harder to characterize. In contrast, once a convergence to GP can be ensured, well established tools from the GP literature \citep[see, e.g.,][]{williams2006gaussian} can be brought to the fore to allow straightforward posterior inference. The induced covariance function depends on the choice of the nonlinear activation function $g(\cdot)$, and is in general anisotropic. However, it can be worked out in explicit form under a variety of activation functions for both shallow \citep{neal1996priors,williams1996, cho2009kernel} and deep \citep{lee2018deep,matthews2018} feedforward neural networks, where for deep networks usually a recursive formula is available that expresses the covariance function of a given layer conditional on the previous layer. The benefit of depth is that it allows a potentially very rich covariance function at the level of the observed data, even if the covariances in each layer conditional on the layer below are simple. Viewing a GP as a prior on the function space, this allows for a rich class of prior structures. However, the process is still Gaussian in all these cases and our intention in this paper is a departure from the Gaussian world.

For finite width neural networks, non-Gaussian weights have recently been considered by \citet{Fortuin2022bayesian} and  \citet{Fortuin2022Review}. Departures from  $i.i.d.$ weights have also recently received attention \citep{caron2023overparameterised,lee2023deep}. Theoretical results with infinite variance were hinted at by \citet{neal1996priors}, and first proved by \citet{DerLee2005}. Follow-up theoretical results have been obtained in varied architectures for bounded activation functions \citep{Peluchetti,bracale2022infinitechannel} and with unbounded activation functions \citep{Bordino2022infinitely}. However, posterior inference still remains challenging in the infinite-width limit, due to the reasons made clear in the next sub-section.

\subsection{Challenges Posed by Network Weights with Unbounded Prior Variance}
Although the GP literature has been immensely influential for uncertainty quantification in DNNs, it is obvious that a DNN does not converge to a GP if the final hidden to output layer weights are allowed to have unbounded variance, e.g., belonging to $t$ or others in the stable family, such that the scaling limit distribution is non-Gaussian \citep{gnedenko1954limit}. This was already observed by \citet{neal1996priors} who admits: \emph{``in contrast to the situation for Gaussian process priors, whose properties are captured by their covariance functions, I know of no simple way to characterize the distributions over functions produced by the priors based on non-Gaussian stable distributions.''} Faced with this difficulty, \citet{neal1996priors} confines himself to forward simulations from DNNs with $t$ weights, and yet, observes that the network realizations under these weights demonstrate very different behavior (e.g., large jumps) compared to normal priors on the weights. This is not surprising, since Gaussian processes, with their almost surely continuous sample paths, are not necessarily good candidate models for functions containing sharp jumps, perhaps explaining their lack of popularity in certain application domains, e.g., finance, where jumps and changepoints need to be modeled \citep[see, e.g., Chapter 7 of][]{ContTankov}. Another key benefit of priors with polynomial tails, pointed out by \citet{neal1996priors}, is that it allows a few hidden nodes to make a large contribution to the output, while drowning out the others, akin to feature selection. In contrast, in the GP limit, the contributions of individual nodes are averaged out. Thus, there are clear motivations for developing computationally feasible  posterior inference machinery under these non-Gaussian limits.

\citet{neal1996priors} further hints that it may be possible to prove an analogous result using priors that do not have finite variance. Specifically, suppose the network weights are given symmetric $\alpha$-stable priors, which have unbounded variance for all $\alpha \in (0,2)$, and the $\alpha=2$ case coincides with a Gaussian random variable. If $X$ is an $\alpha$-stable random variable, the density does not in general have a closed form, but the characteristic function is:
\begin{equation*}
\phi_X(t) = \exp[it\mu-\nu^\alpha\lvert t\rvert^{\alpha} \{1 -  i\beta \mathrm{sign}(t)\omega(t;\alpha) \}],    
\end{equation*}
where $\omega(t;\alpha)=\tan(\alpha\pi/2),$ for $\alpha \neq 1$ and $\omega(t;\alpha)=-(2/\pi)\log\lvert t\rvert$, for $\alpha=1$. Here $\mu\in \mathbb{R}$ is called the shift parameter, $\alpha\in(0,2]$ is the index parameter, $\beta\in[-1,1]$ is the symmetry parameter, and $\nu>0$ is the scale parameter \citep[][p. 5]{SamorodnitskyTaqqu}. Throughout, we use a zero shift ($\mu=0$) stable variable, and denote it by $X\sim S(\alpha,\nu,\beta)$. Here $\beta=0$ corresponds to the symmetric case, and when ${\beta=1},{\alpha<1},{\nu=1}$, the random variable is strictly positive, which we denote by $S^{+}(\alpha)$. We refer the reader to Supplementary Material~\ref{sup:alpha_st} for some further properties of $\alpha$-stable random variables, as relevant for the present work. \citet{DerLee2005} confirm Neal's conjecture by establishing that the scaling limit of a shallow neural network under $\alpha$-stable priors on the weights is an $\alpha$-stable process. Proceeding further, \citet{Peluchetti} show that the limit process for infinitely wide DNNs with infinite-variance priors is also an $\alpha$-stable processes. However, both \citet{DerLee2005} and \citet{Peluchetti} only consider the forward process and neither considers posterior inference. Inference using $\alpha$-stable densities is not straightforward, and some relevant studies are by \citet{SamorodnitskyTaqqu}, \citet{Lemke2015}, and more recently by \citet{Nolan2020}. The main challenge is that a covariance function is not necessarily defined, precluding posterior inference analogous to the GP case, for example, using the \emph{kriging} \citep{Stein1999} machinery. To this end, our contribution lies in using a representation of the characteristic function of symmetric $\alpha$-stable variables as a normal scale mixture, that then allows a \emph{conditionally Gaussian} representation.  This makes it possible to develop posterior inference and prediction techniques under stable priors on network weights using a latent Gaussian framework.

\subsection{Summary of Main Contributions}
Our main contributions consist of:
\begin{enumerate}
    \item An explicit characterization of the posterior predictive density function under infinite width scaling limits for shallow (one hidden layer) Bayesian neural networks under stable priors on the network weights, using a latent Gaussian representation.

    \item An MCMC algorithm for posterior inference and prediction, with publicly available code. 
    %on \texttt{Github} at: \href{https://github.com/loriaJ/alphastableNNet}{https://github.com/loriaJ/alphastableNNet}.

    \item Numerical experiments in one and two dimensions that validate our procedure by obtaining better posterior predictive properties for functions with jumps and discontinuities, compared to both Gaussian processes and Bayesian neural networks of finite width. 

    \item A real world application on a benchmark real estate data set from the UCI repository.
\end{enumerate}
	
	
\section{INFINITE WIDTH LIMITS OF BAYESIAN NEURAL NETWORKS UNDER WEIGHTS WITH UNBOUNDED VARIANCE}
Consider the case of a shallow, one hidden layer network, with the weights of the last layer being independent and identically distributed with symmetric $\alpha$-stable priors. Our results are derived under this setting using the following proposition of~\citet{DerLee2005}. 
\begin{proposition}\citep{DerLee2005}.
\label{prop:derlee}
    Let the network specified by Equations \eqref{eq:nn1} and \eqref{eq:nn2}, with a single hidden layer ($K=2$), have i.i.d. hidden-to-output weights $w_{j}^{(2)}$ distributed as a symmetric $\alpha$-stable with scale parameter $(\nu/2)^{1/2}{p}_{2}^{-1/\alpha}$. Then $y(\mathbf{x})$ converges in distribution to a symmetric $\alpha$-stable process $f(\mathbf{x})$ as $p_2 \to \infty$ for random input-to-hidden weights. The finite dimensional distribution of $f(\mathbf{x})$, denoted as $(f(\mathbf{x}_1),\dots,f(\mathbf{x}_n))$ for all $n$, where $\mathbf{x}_i\in\mathbb{R}^I$, is multivariate stable with a characteristic function:
    \begin{align}
        \phi(\mathbf{t})&=\mathbb{E}\left[\exp\{i\langle \mathbf{t},f(\mathbf{x})\rangle\}\right] \nonumber \\
        &= \exp\left\{-(\nu/2)^{\alpha/2}\mathbb{E}[\lvert\langle\mathbf{t},\mathbf{g}\rangle\rvert^{\alpha}]\right\},\label{eq:char_fn}
    \end{align}
    where angle brackets denote the inner product, $\mathbf{t} = (t_1,\ldots,t_n)$ is the argument of the characteristic function, $\mathbf{g}=(g(\mathbf{x}_1),\dots,g(\mathbf{x}_n))$, and $g(\mathbf{x})$ is a random variable with the common distribution (across~$j$) of $(z_j^{(2)}(\mathbf{x}_1),\dots,z_j^{(2)}(\mathbf{x}_n))$.
\end{proposition}
Following \citet{neal1996priors}, assume for the rest of the paper that the activation function $g(\cdot)$ corresponds to the sign function: $\mathrm{sign}(\xi)=1$, if $\xi>0$; $\mathrm{sign}(\xi)=-1$, if $\xi<0$; and $\mathrm{sign}(0)=0$. For $\mathbf{\xi}\in\mathbb{R}^I$ we define ${g(\mathbf{\xi})=\mathrm{sign}\left(b_0 + \sum_{i=1}^Iw_i\xi_i\right)}$, where $b_0$ and $w_i$ are i.i.d. standard Gaussian variables. The next challenge is to compute the expectation within the exponential in Equation~\eqref{eq:char_fn}.  
To resolve this, we break it into simpler cases. Define $\Lambda$ as the set of all possible functions $\tau:\{\mathbf{x}_1,\dots,\mathbf{x}_n\}\to\{-1,+1\}$. Noting that each $\mathbf{x}_j$ can be mapped to two possible options: $+1$ and $-1$, indicates that there are $2^n$ elements in $\Lambda$. For each $\ell =1,\dots,2^n$, consider $\tau_\ell\in\Lambda$, the event ${A_{\ell}=\{\tau_\ell(\mathbf{x}_j) = g(\mathbf{x}_j)\}_{j=1}^n}$, and the probability ${q_\ell=\mathbb{P}(A_{\ell})}$. By definition $\{A_{\ell}\}_{\ell =1}^{2^n}$ is a set of disjoint events. Next, using the definition of the expectation of discrete disjoint events we obtain:
\begin{equation}
    \mathbb{E}[\lvert \langle \mathbf{t},\mathbf{g}\rangle \rvert^{\alpha}] = \sum_{\ell = 1}^{2^n} q_\ell \left\lvert\sum_{j=1}^n t_j\tau_\ell(\mathbf{x}_j)\right\rvert^{\alpha} \label{eq:expectation},
\end{equation}
where the expectation is over input-to-hidden weights. A na\"ive enumeration sums over an exponential number of terms in $n$, and is impractical. However, details of the computation of $q_\ell$ and $\tau_\ell$ are given in Supplementary Section~\ref{sup:comp_q_s_general}, where we show how to reduce the enumeration over $2^n$ terms in Equation~\eqref{eq:expectation} to  $L=\mathcal{O}(n^I)$ terms using the algorithm of \citet{GoodmanPollack}, by identifying only those configurations with $q_\ell >0$. This allows circumventing the exponential enumeration in $n$, resulting in a polynomial complexity algorithm, depending on the input dimension $I$. Although this computational complexity still appears rather high at a first glance, especially for high-dimensional inputs, for two or three-dimensional problems (e.g., in spatial or spatio-temporal models), the computation is both manageable and practical, and the complexity is similar to the usual GP regression.
\subsection{A Characterization of the Posterior Predictive Density under Stable Network Weights using a Conditionally Gaussian Representation}
While the previous section demonstrated the characteristic function in Equation~\eqref{eq:char_fn} can be computed, the resulting density, obtained via its inverse Fourier transform, does not necessarily have a closed form, apart from specific values of $\alpha$, such as $\alpha=2$ (Gaussian), $\alpha=1$ (Cauchy) or $\alpha=0.5$ (inverse Gaussian). In this section we show that a \emph{conditionally Gaussian} characterization of the density function is still possible for the entire domain of $\alpha \in (0,2]$, facilitating posterior inference. First, note that the result of \citet{DerLee2005} is obtained assuming that there is no intrinsic error in the observation model, i.e., they assume the observations are obtained as $y_i=f(\mathbf{x}_i)$, and the only source of randomness is the network weights. We generalize this to more realistic scenarios and consider an additive error term. That is, we consider the observation model $y_i=f(\mathbf{x}_i)+\varepsilon_i$, where the error terms $\varepsilon_i$ are independent identically distributed normal random variables with constant variance $\sigma^2$. Using the expression for the expectation in the characteristic function from Proposition~\ref{prop:derlee}, we derive the full probability density function, as specified in the following theorem, with a proof in Section~\ref{pf:d_dim_pdf}.
\begin{theorem}\label{th:d_dim_pdf}
    For real-valued observations $\mathbf{y}=(y_1,\dots,y_n)$ under the model $y_i=f(\mathbf{x}_i) + \varepsilon_i;$ where  $ \varepsilon_i\stackrel{i.i.d.}\sim\mathcal{N}(0,\sigma^2)$ and $f(\cdot)$ is as specified in Proposition~\ref{prop:derlee} , denote the matrix $\mathbf{X}=[\mathbf{x}_1,\dots,\mathbf{x}_n]^T$. The probability density function of $(\mathbf{y}\mid \mathbf{X})$ is:
	\begin{align*}
        p(\mathbf{y}\mid \mathbf{X}) =& (2\pi)^{-n/2}\int_{(\mathbb{R}^+)^{L}} \exp\left(-\frac{1}{2}\mathbf{y}^T\mathbf{Q}_\mathbf{s}^{-1}\mathbf{y}\right)\\
        &\times \det(\mathbf{Q}_\mathbf{s})^{-1/2}\prod_{\ell =1 }^Lp_{S^+}(s_\ell)ds_\ell,
	\end{align*}
  where $p_{S^+}$ is the density for a positive $\alpha/2$-stable random variable, and $\mathbf{Q}_\mathbf{s}$ is a positive definite matrix with probability one that depends on $\mathbf{s}=\{s_\ell\}_{\ell =1 }^L$. Specifically, $\mathbf{Q}_\mathbf{s}=\sum_{\ell=1}^Ls_\ell q_\ell^{2/\alpha} \boldsymbol{\tau}_\ell\boldsymbol{\tau}_\ell^T + \sigma^2 \mathbf{I}$, denoting by $\boldsymbol{\tau}_\ell\in\mathbb{R}^n$ the vector with entries $(\tau_\ell(\mathbf{x}_1),\dots,\tau_\ell(\mathbf{x}_n))$.
\end{theorem}
Theorem~\ref{th:d_dim_pdf} is the main machinery we need for posterior inference. We emphasize that $\mathbf{Q}_\mathbf{s}$ is a matrix with random entries, conditional on $\{s_\ell\}_{\ell=1}^L$; and $\{q_\ell\}_{\ell=1}^L$ and $\{\tau_\ell\}_{\ell=1}^L$ are deterministic. Further, the input variables $\mathbf{X}$ are also deterministic. Theorem~\ref{th:d_dim_pdf} implies the hierarchical Gaussian model:
	\begin{align*}
		\mathbf{y}\mid \mathbf{X}, \{s_\ell\}_{\ell=1}^L &\sim \mathcal{N}_n(0,\mathbf{Q}_\mathbf{s}), \; 
		s_\ell \overset{i.i.d.}{\sim} S^+(\alpha/2).
	\end{align*}
Each of the $\tau$s defines a level set for the points that lie in the $+1$ side in contraposition to those that lie in the $-1$ side. A forward simulation of this model is a weighted sum of the $\tau$s, with corresponding positive weight for those that lie closer and a negative weight for the points that lie farther. We further present the following corollaries to interpret the distribution of $\mathbf{Q}_\mathbf{s}$.
\begin{corollary}\label{cor:random_Q}
The matrix $\mathbf{Q}_\mathbf{s}$ in Theorem~\ref{th:d_dim_pdf} is stochastic for all $\alpha\in (0,2)$ and is deterministic when $\alpha=2$.
\end{corollary}
\begin{proof}
Recall, $\mathbf{Q}_\mathbf{s}=\sum_{\ell=1}^Ls_\ell q_\ell^{2/\alpha} \boldsymbol{\tau}_\ell\boldsymbol{\tau}_\ell^T + \sigma^2 \mathbf{I}, \; s_\ell \overset{i.i.d.}{\sim} S^+(\alpha/2)$. Thus, $ s_\ell\to1$, w.p. $1$, as $\alpha\to2$, since an $S^+(1)$ variable is a degenerate point mass at $1$.
\end{proof}
Noting the $\alpha=2$ case is Gaussian, Corollary~\ref{cor:random_Q} indicates $\mathbf{Q}_\mathbf{s}$ is stochastic in the stable limit, but deterministic in the GP limit; a key difference. The lack of representation learning in the GP limit, due to the kernel converging to a degenerate point mass, is a major criticism of the GP limit framework, see for example \citet{pmlr-v139-aitchison21a,pmlr-v202-yang23k}. A useful implication of Corollary~\ref{cor:random_Q} is that the posterior of ${\mathbf{Q}_\mathbf{s} \mid \mathbf{y}}$ is non-degenerate in the stable limit. Numerical results supporting this claim are in Supplementary Section~\ref{sup:simulations}. Specifically, when $\alpha=2$, the limiting process of Proposition~\ref{prop:derlee} is a GP, which has been established to have a deterministic covariance kernel \citep{cho2009kernel}. When $\alpha<2$, which is our main interest, Corollary~\ref{cor:random_Q} ensures that the \emph{conditional} covariance kernel is  stochastic, thereby enabling learning a degenerate posterior of this quantity given the data. This is at a contrast to the degenerate posterior in the Gaussian case, for both shallow and deep infinite-width limits of BNNs, as discussed by \citet{pmlr-v139-aitchison21a}. We further remark here that although the current work only considers shallow networks, this property of a non-degenerate posterior should still hold for deep networks under $\alpha$-stable weights. However, the challenge of relating the \emph{conditional} covariance kernel of each layer to the layer below, analogous to the deep GP case \citep[e.g.,][]{matthews2018}, is beyond the scope of the current work.
\begin{corollary}\label{cor:q_dist_margin}
The marginal distribution of the diagonal entries of the matrix $\mathbf{Q}_\mathbf{s}$ is $\sigma^2+\nu S^+(\alpha/2)$, where the $\sigma^2$ acts as a shift parameter, and the marginal distribution of the entry $i,j$ in the $\mathbf{Q}_\mathbf{s}$ matrix is $S(\alpha/2,\nu , 2p_{ij}-1)$, where ${p_{ij}=\sum_{\ell:\tau_\ell(\mathbf{x}_i)=\tau_\ell(\mathbf{x}_j)}q_\ell}$, the probability that $\mathbf{x}_i$ and $\mathbf{x}_j$ lie on the same side of the hyperplane partition. Further, the entries of $\mathbf{Q}_\mathbf{s}$ are not independent.
\end{corollary}
\begin{proof}
For $\{\mathbf{Q}_\mathbf{s}\}_{ii}-\sigma^2$, we apply Property 1.2.1 of \citet{SamorodnitskyTaqqu}, which we refer as the closure property, to obtain $\nu S^+(\alpha/2)$. Next for $\{\mathbf{Q}_\mathbf{s}\}_{ij}$, we split the summation in two cases: $\tau_\ell(\mathbf{x}_i)=\tau_\ell(\mathbf{x}_j)$, and   $\tau_\ell(\mathbf{x}_i)\neq\tau_\ell(\mathbf{x}_j)$. Using the closure property in the separate splits, we obtain ${\{\mathbf{Q}_\mathbf{s}\}_{ij}\sim S(\alpha/2,\nu p_{ij}^{2/\alpha},1) - S(\alpha/2,\nu (1-p_{ij})^{2/\alpha},1)}$. The result follows by applying the closure property once more. The entries of $\mathbf{Q}_\mathbf{s}$ are not independent, as they are obtained from a linear combination of the independent variables $\{s_\ell\}_{\ell = 1}^L$.
\end{proof}
The value of this corollary does not lie in a numerical or computational speed-up, since the obtained marginals are not independent, but rather in the interpretability that it lends to the model. Explicitly, it indicates that when the points lie closer their conditional covariance is more likely to be positive and when they lie farther apart the covariance is more likely to be negative \citep[see Proposition 1.2.14 of][]{SamorodnitskyTaqqu}.

Now that the probability model is clear, we proceed to the next problem of interest: prediction. 
To this end, we present the following proposition, characterizing the posterior predictive density.
\begin{proposition}\label{prop:d_dim_pred}
Consider a vector of $n$ real-valued observations $\mathbf{y}=(y_1,\dots,y_n)$, each with respective input variables ${\mathbf{x}_1,\dots,\mathbf{x}_n \in \mathbb{R}^I}$, under the model $y_i=f(\mathbf{x}_i)+\varepsilon_i;\varepsilon_i\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma^2),$ and $m$ new input variable locations: $\mathbf{x}_1^*,\dots,\mathbf{x}_m^*\in\mathbb{R}^I$, with future observations at those locations denoted by $\mathbf{y}^*=(\mathbf{y}^*_1, \ldots, \mathbf{y}^*_m)$. Denote the matrices $\mathbf{X}=[\mathbf{x}_1,\dots,\mathbf{x}_n]^T$ and $\mathbf{X}^*=[\mathbf{x}_1^*,\dots,\mathbf{x}_m^*]^T$. The posterior distribution at these new input variables satisfies the following properties:
\begin{enumerate}
    \item The conditional posterior  of $\mathbf{y}^*\mid \mathbf{y},\mathbf{X},\mathbf{X}^*,\mathbf{Q}_\mathbf{s}$ is an $m$-dimensional Gaussian. Specifically:
    \begin{equation}
        \mathbf{y}^*\mid \mathbf{y},\mathbf{X},\mathbf{X}^*,\mathbf{Q}_\mathbf{s}\sim \mathcal{N}_m (\boldsymbol{\mu}^*,\mathbf{\Sigma}^*),\label{eq:posterior_normal}
    \end{equation}
    where $\boldsymbol{\mu}^*=\mathbf{Q}_{*,1:n}\mathbf{Q}_{1:n,1:n}^{-1}\mathbf{y}$, and $\mathbf{\Sigma}^*={\mathbf{Q}_{*,*} - \mathbf{Q}_{*,1:n}\mathbf{Q}_{1:n,1:n}^{-1}\mathbf{Q}_{1:n,*}}$, using $\mathbf{Q}_\mathbf{s}$ as previously defined for the $n+m$ input variables, and denoting by `$*$' the entries $(n+1):(n+m)$.

    \item The posterior predictive density at the $m$ new locations conditional on the observations $\mathbf{y}$, is given by:
    \begin{align}
        p(\mathbf{y}^*\mid \mathbf{y},\mathbf{X},\mathbf{X}^*) =& \int_{(\mathbb{R}^+)^L} p(\mathbf{y}^*\mid \mathbf{y},\mathbf{X},\mathbf{X}^*,\mathbf{Q}_\mathbf{s}) \nonumber \\
        &\times p(\mathbf{Q}_\mathbf{s}\mid \mathbf{y},\mathbf{X})d\mathbf{Q}_\mathbf{s},\label{eq:ppd}
    \end{align}
    where $p(\mathbf{y}^*\mid \mathbf{y},\mathbf{X},\mathbf{X}^*,\mathbf{Q}_\mathbf{s})$ is the conditional posterior density of $\mathbf{y}^*$, $\mathbf{Q}_\mathbf{s}$ is as previously described for the $n+m$ input variables, and the integral is over the values determined by $\{s_\ell\}_{\ell=1}^L$.
\end{enumerate}
\end{proposition}
\begin{proof}
    The first is an immediate application of Theorem~\ref{th:d_dim_pdf} and the conditional density of a multivariate Gaussian. The second part follows from a standard application of marginal probabilities.
\end{proof}
 
\section{AN MCMC SAMPLER FOR THE POSTERIOR PREDICTIVE DISTRIBUTION}
Dealing with $\alpha$-stable random variables includes the difficulty that the moments of the variables are only finite up to an $\alpha$ power. Specifically, for $\alpha<2$, if $X\sim S(\alpha,\nu,\beta)$, then $\mathbb{E}[\lvert X \rvert ^r]=\infty$  if $r\geq\alpha$, and is finite otherwise \citep[][Property 1.2.16]{SamorodnitskyTaqqu}. 
To circumvent dealing with potentially ill-defined moments, we propose to sample from the full posterior. For fully Bayesian inference, we assign $\sigma^2$ a half-Cauchy prior \citep{gelman2006prior} and iteratively sample from the posterior predictive distribution by cycling through $(\mathbf{y}^*,\mathbf{Q},\sigma^2)$ in an MCMC scheme, 
as described in Algorithm~\ref{alg:samp_full_d}, which has computational complexity of the order of $\mathcal{O}(T [(n+m)^In^2 + m^3])$, where $T$ is the number of MCMC simulations used. The method of \citet{chambers1976method} is used for simulating the stable variables. 
An implementation of our algorithms is freely available at \href{https://github.com/loriaJ/alphastableNNet}{https://github.com/loriaJ/alphastableNNet}.
\begin{algorithm}[!htb]
	\caption{A Metropolis--Hastings sampler for the posterior predictive distribution}\label{alg:samp_full_d}
	\begin{algorithmic}[-1] 
        {\Require Observations $\mathbf{y}\in\mathbb{R}^n$, with $I$-dimensional input variables $\mathbf{X}\in\mathbb{R}^{n\times I}$, new input variables $\mathbf{X}^*\in\mathbb{R}^{m\times I}$, and number of MCMC iterations $T$.  \\ \hspace{-.6cm}\textbf{Output:} Posterior predictive samples $\{\mathbf{y}^*_{k}\}_{k=1}^T$}
		\State Obtain $\Lambda$ for $(\mathbf{X},\mathbf{X}^*)$ using Algorithm~\ref{alg:goodman}.
        \State Compute $\{q_\ell\}_{\ell = 1}^L$ as described in Supplementary Section~\ref{sup:comp_q_s_general}. 
		\State Initialize $\mathbf{Q}^{(0)}_\mathbf{s}$ using independent samples of $s_\ell$ from the prior distributions.
		\For{$k=1,\dots,T$}
		      \State Simulate $\mathbf{Q}^{(k)}_\mathbf{s}\mid \mathbf{y},\mathbf{Q}^{(k-1)}_\mathbf{s}$ using Algorithm~\ref{alg:samp_Q}.
		      \State Compute $\boldsymbol{\mu}^*_k$ and $\mathbf{\Sigma}^*_k$ using $\mathbf{Q}^{(k)}_\mathbf{s}$ in Part 1 of Proposition~\ref{prop:d_dim_pred}.
		      \State Simulate $\mathbf{y}_{k}^{*}\mid (\mathbf{y},\mathbf{Q}^{(k)}_\mathbf{s})\sim \mathcal{N}_m(\boldsymbol{\mu}^*_k,\boldsymbol{\Sigma}^*_k)$.
		\EndFor
		\State \Return $\{\mathbf{y}_{k}^{*}\}_{k=1}^T$.
	\end{algorithmic}
\end{algorithm}

There are two hyper-parameters in our model: $\alpha,\nu$. We propose to select them by cross validation on a grid of $(\alpha,\nu)$, and selecting the result with smallest mean absolute error (MAE). Another possible way to select $\alpha$ is by assigning a prior. The natural choice for $\alpha$ is a uniform prior on $(0,2)$, however the update rule would need to consider the densities of the $L$ such $\alpha/2$-stable densities $p_{S^+}$, which would be computationally intensive as there is no closed form to this density apart from specific values of $\alpha$. A prior for $\nu$ could be included but a potential issue of identifiability emerges, similar to that  identified for the Mat\'ern kernel \citep{Zhang2004}. We leave these open for future research.

\section{NUMERICAL EXPERIMENTS}\label{sec:nums}
We compare our method against the predictions obtained from three other methods. 
The first two are methods for Gaussian processes that correspond to the two main approaches in GP inference: maximum likelihood with a Gaussian covariance kernel \citep{mlegp}, and an MCMC based Bayesian procedure using the Mat\'ern kernel \citep{tgp}; the third method is a two-layer Bayesian neural network (BNN) using a single hidden layer of 100 nodes with Gaussian priors, implemented in \texttt{pytorch} \citep{pytorch}, and fitted via a variational approach. The choice of a modest number of hidden nodes is intentional, so that we are away from the infinite width GP limit, and the finite-dimensional behavior can be visualized. The respective implementations are in the \texttt{R} packages \texttt{mlegp},  \texttt{tgp}, and the \texttt{python} libraries \texttt{pytorch} and \texttt{torchbnn}. The estimates used from these methods are respectively the kriging estimate, posterior median and posterior mean. We tune our method by cross-validation over a grid of $(\alpha,\nu)$. We use point-wise posterior median as the estimate, and report the values with smallest mean absolute error (MAE) and the optimal parameters. Results on timing and additional simulations are in Supplementary Section \ref{sup:simulations}, including the posterior quantiles of $\mathbf{Q}_\mathbf{s}\mid \mathbf{y}$, showing the posterior is non-degenerate and stochastic. This suggests learning a non-degenerate posterior $\mathbf{Q}_\mathbf{s}\mid \mathbf{y}$ is possible, unlike in the GP limit where the kernel is degenerate \citep{pmlr-v139-aitchison21a,pmlr-v202-yang23k}. We use a data generating mechanism of the form $y=f(x)+\varepsilon$, where the $f$ is the true function. The overall summary is that when $f$ has at least one discontinuity, our method performs better at prediction than the competing methods, and when $f$ is continuous the proposed method performs just as well as the other methods. This provides empirical support that the assumption of \emph{continuity} of the true function cannot be disregarded in the \emph{universal approximation} property of neural networks \citep{hornik1989multilayer}, and the adoption of infinite variance prior weights might be a crucial missing ingredient for successful posterior prediction when the truth is discontinuous.

\subsection{Experiments in One Dimension}\label{sec:1d}
We consider a function with three jumps: $f(x) = 5 \times \mathbf{1}_{\{x\geq 1\}} +5 \times \mathbf{1}_{\{ -1 \leq x < 0\}}$, to which we add a Gaussian noise with $\sigma=0.5$. We consider $x\in [-2,2],$ with 40 equally spaced points as the training set, and 100 equally spaced points in the testing set.
We display the two-panel Figure~\ref{fig:3jumps_all}, showing the comparisons between the four methods. The boxplots in the left panel show that the proposed method -- which we term \textit{``Stable,''} has the smallest  prediction error. The right panel shows that the BNN, the GP based fully Bayesian, and maximum likelihood methods have much smoother predictions than the Stable method. This indicates the inability of these methods to capture sharp jumps as well as the Stable method, which very clearly captures them. The Stable method obtains the smallest cross validation error for this case with $\alpha^*=1.1$ and $\nu^*=1$.
\begin{figure}[!h]
    \centering
    \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1.1_sigma_0.5_sigma_unk_nu_1_type_4_dim_1.jpg}
    \caption{\emph{Left:} Boxplots of mean absolute error of out-of-sample prediction over test points, and \emph{Right:} predicted values over 100 points on a regular grid on $[-2,2]$. Training points in black dots.}
    \label{fig:3jumps_all}
\end{figure}

Figure~\ref{fig:UQ_1d} displays the uncertainty of the GP Bayes and Stable methods, for the same setting, using the $90\%$ posterior predictive intervals. In general, the intervals are narrower for the stable case.
\begin{figure}[!h]
    \centering
    \includegraphics[width =\linewidth]{images/UQ_results_1d_2.jpg}
    \caption{The point-wise $90\%$ posterior predictive intervals for GP Bayes and Stable over 100 points on a regular grid on $[-2,2]$, training points in black.}
    \label{fig:UQ_1d}
\end{figure}
\subsection{Experiments in Two Dimensions}\label{sec:2d}
We consider the function: $f(x_1,x_2)=5\times \mathbf{1}_{\{x_1 >0\}} + 5\times \mathbf{1}_{\{x_2 > 0\}}$, with additive Gaussian noise with $\sigma=0.5$, and observations on an equally-spaced grid of 49 points in the square $[-1,1]^2$. In Figure~\ref{fig:2d_all} we display the boxplots and contour plots for all methods for out-of-sample prediction on an equally spaced grid of $9\times9$ points in the same square. The methods that employ Gaussian processes (GP Bayes and GP MLE) and BNN seem to have smoother transitions between the different quadrants, whereas the Stable method captures the sharp jumps better. This is reinforced by the prediction errors displayed in the left panel. For this example, the Stable method obtains the smallest cross validation error with $\alpha^*=1.1,\nu^*=1$.
\begin{figure}[!h]    
    \centering
    \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1.1_sigma_0.5_sigma_unk_nu_1_type_4_dim_2_2.jpg}
    \caption{\emph{Left:} Boxplots of mean absolute error (MAE) of out-of-sample prediction over test points, and \emph{Right:} predicted values over a $9\times 9$ grid on $[-1,1]^2$.}
    \label{fig:2d_all}
\end{figure}

We present quantiles of the posterior predictive distribution for GP Bayes and Stable methods in Figure~\ref{fig:UQ_2d}. Our results show sharper jumps using the Stable method, when the true function has jump discontinuities.
\begin{figure}[!h]
    \centering
    \includegraphics[width =\linewidth]{images/UQ_results_2d_2.jpg}
    \caption{Posterior predictive quantiles at the $5\%,50\%$, and $95\%$ levels for GP Bayes (\emph{upper}) and Stable (\emph{lower}) over a $9\times 9$ grid on $[-1,1]^2$.}
    \label{fig:UQ_2d}
\end{figure}
\section{OUT OF SAMPLE PREDICTION ON REAL ESTATE VALUATION DATA IN TAIPEI}\label{sec:real_state}
Valuation of real estate properties in Taipei, Taiwan were collected by \citet{IChengTzuKuang} in different locations. The data are available from the UCI repository\footnote{\url{https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set}}, and is a benchmark dataset. We apply our method to the spatial locations of the properties, to predict the valuations of the real estate dataset. We use 276 locations for training and 138 for testing. We compare the performance of our method to the three methods mentioned in the previous section through the mean absolute error. The results are displayed in Table~\ref{tab:MAE_real_estate}, showing a competitive MAE under the proposed approach. Figure~\ref{fig:posterior_predictions_real_estate} displays the posterior predictive quantiles on the validation set, with narrower intervals under the proposed stable method in most cases. 
\begin{table}[!h]
    \centering
    \caption{Mean absolute error of predictions by method and standard errors computed on 10 random training--testing splits in the real estate data set.} 
    \label{tab:MAE_real_estate}
       \begin{tabular}{c c c c c}
     \toprule
         & Stable & GP MLE & GP Bayes & Bayes NNet \\ \hline
         MAE & 0.415 & 0.483 & 0.402  & 0.501 \\
         (SE) & (0.07) & (0.07) & (0.05) & (0.07) \\
    \bottomrule
    \end{tabular}
\end{table}

Supplementary Sections~\ref{sec:ablation} and~\ref{sec:common_data} present results on ablation experiments and additional data sets with larger sample sizes, including recent data on S\&P stock index.
\begin{figure}[!t]
    \centering
    \includegraphics[width=\linewidth]{images/posterior_predictions_real_estate.jpg}
    \caption{Posterior predictive quantiles at the 5\%, 50\%, and 95\% levels for GP Bayes (\emph{upper}) and Stable (\emph{lower}) on validation.}
    \label{fig:posterior_predictions_real_estate}
\end{figure}
\section{CONCLUSIONS}
We develop a novel method for posterior inference and prediction for infinite width limits of shallow (one hidden layer) BNNs under weights with infinite prior variance. While the $\alpha$-stable forward scaling limit in this case has been known in the literature \citep{DerLee2005, Peluchetti}, the lack of a covariance function precludes the inverse problem of feasible posterior inference and prediction, which we overcome using a conditionally Gaussian representation. There is a wealth of literature on the universal approximation property of both shallow and deep neural networks, following the pioneering work of \citet{hornik1989multilayer}, but they work under the assumption of a \emph{continuous} true function. Our numerical results demonstrate that when the truth has jump discontinuities, it is possible to obtain much better results with a BNN using weights with unbounded prior variance. The fully Bayesian posterior also allows straightforward probabilistic uncertainty quantification for the infinite width scaling limit under $\alpha$-stable priors on network weights.

Several future directions could naturally follow from the current work. The most immediate is perhaps an extension to posterior inference for deep networks under stable priors, where the width of each layer simultaneously approaches infinity, and we strongly suspect this should be possible. The role of the non-degenerate posterior of $\mathbf{Q}_\mathbf{s}\mid \mathbf{y}$ on deep generalizations and representation learning merits a  thorough investigation and suggests crucial differences from a GP limit \citep{pmlr-v139-aitchison21a,pmlr-v202-yang23k}. Developing  analogous results under non-i.i.d. or tied weights to perform posterior inference under the scaling limits for Bayesian convolutional neural networks should also be of interest. Finally, one may of course investigate alternative activation functions, such as the hyperbolic tangent, which will lead to a different characteristic function for the scaling limit. 

\section{Proof of Theorem \ref{th:d_dim_pdf}}\label{pf:d_dim_pdf}
Our derivations rely on Equation 5.4.6 of \citet{UchaikinZolotarev}, which states that for $\alpha_0\in (0,1)$ and for all positive $\lambda$ one has: ${\exp(-\lambda^{\alpha_0}) = \int_0^\infty \exp(-\lambda t) p_{S^+}(t)dt},$ where $p_{S^+}$ is the density function of a positive $\alpha_0$-stable random variable. Using $\lambda=\nu z^2$ and $\alpha_0=\alpha/2$ and the fact that $z^2=\lvert z \rvert^2$, we obtain for $\alpha\in (0,2)$ that: 
\small
\begin{equation}
  \exp(-\nu^{\alpha/2}\lvert z\rvert^\alpha)=\int_{0}^\infty \exp(-\nu z^2t)p_{S^+}(t)dt \label{eq:normal_mixture},
\end{equation}
\normalsize
where $p_{S^+}$ is the density function of a positive $\alpha/2$-stable random variable and $z\in\mathbb{R}$. By Equation~\eqref{eq:expectation} of \citet{DerLee2005},  and the characteristic function of independent normally distributed error terms with a common variance $\sigma^2$, we have that:
\small
\begin{align*}
    \phi_\mathbf{y}(\mathbf{t})
    =& \exp\left(- \sum_{\ell = 1}^L2^{-\alpha/2}\nu^{\alpha/2}q_\ell\left\lvert \sum_{j=1}^nt_j\tau_\ell(\mathbf{x}_j)\right\rvert^{\alpha} \right. \\
    & \hspace{1cm} \left. -\frac{1}{2}\sum_{j=1}^n \sigma^2t_j^2\right)\\
    =& \prod_{\ell=1}^L\exp\left(-2^{-\alpha/2}\nu^{\alpha/2}q_\ell \left\lvert \sum_{j=1}^n t_j\tau_\ell(\mathbf{x}_j) \right\rvert^\alpha\right) \\
    &\times\exp\left(-\frac{1}{2}\sum_{j=1}^n \sigma^2t_j^2\right)\\
    =& \prod_{\ell=1}^L\Bigg\{\int_{0}^\infty \exp\left(-\frac{1}{2}\nu s_\ell q_\ell^{2/\alpha} \left( \sum_{j=1}^n t_j\tau_\ell(\mathbf{x}_j) \right)^2\right) \\
    & \times  p_{S^+}(s_\ell)ds_\ell \Bigg\}\times \exp\left(-\frac{1}{2}\sum_{j=1}^n \sigma^2t_j^2\right)\\
    =& \prod_{\ell=1}^L \Bigg\{\int_{0}^\infty \exp\left(-\frac{1}{2}\nu s_\ell q_\ell^{2/\alpha} \mathbf{t}^T \mathbf{M}_\ell \mathbf{t} \right)  \\
    &\times p_{S^+}(s_\ell)ds_\ell\Bigg\}\times \exp\left(-\frac{1}{2}\sum_{j=1}^n \sigma^2t_j^2\right),
\end{align*}
\normalsize
where the third equality follows by using Equation~\eqref{eq:normal_mixture}, and in the last equality we define $\mathbf{M}_\ell$ as the matrix with ones in the diagonal and with the $(i,j)$th entry given by $\tau_\ell(\mathbf{x}_i)\tau_\ell(\mathbf{x}_j), \; i\ne j$. Next, using the fact that the densities are over the independent variables $\{s_\ell\}_{\ell=1}^L$ we bring the product inside the integrals and employ the property of the exponential to obtain:
\small
\begin{align*}
    \phi_{\mathbf{y}}(\mathbf{t}) 
    =&\int_{(\mathbb{R}^+)^{L}} \exp\left(- \frac{1}{2}\sum_{j=1}^nt_j^2\sigma^2 - \frac{1}{2}\nu\sum_{\ell=1}^L  s_\ell q_\ell^{2/\alpha}\mathbf{t}^T \mathbf{M}_\ell\mathbf{t} \right) \\
    & \times\prod_{\ell=1}^Lp_{S^+}(s_\ell)ds_\ell\\
    =& \int_{(\mathbb{R}^+)^{L}} \exp\left(- \frac{1}{2}\mathbf{t}^T\mathbf{Q}_\mathbf{s}\mathbf{t} \right)\prod_{\ell=1}^Lp_{S^+}(s_\ell)ds_\ell,
\end{align*}
\normalsize
using on the second line the definition of $\mathbf{Q}_\mathbf{s}$. The required density is now obtained by the use of the inverse Fourier transform on the characteristic function:
\small
\begin{align*}
    p(\mathbf{y}\mid \mathbf{X}) 
    =& \int_{\mathbb{R}^n}\phi_\mathbf{y}(\mathbf{t})\exp(i\langle\mathbf{t},\mathbf{y}\rangle)\prod_{j=1}^ndt_j\\
    =& \int_{\mathbb{R}^n}\left\{
    \int_{(\mathbb{R}^+)^{L}} \exp\left(- \frac{1}{2}\mathbf{t}^T\mathbf{Q}_\mathbf{s}\mathbf{t} \right)\prod_{\ell=1}^Lp_{S^+}(s_\ell)ds_\ell\right\}  \\ 
    &\times\exp(i\langle\mathbf{t},\mathbf{y}\rangle)\prod_{j=1}^ndt_j\\
    =& \int_{(\mathbb{R}^+)^{L}} \int_{\mathbb{R}^n} \exp\left(- \frac{1}{2}\mathbf{t}^T\mathbf{Q}_\mathbf{s}\mathbf{t} \right) \\ 
    & \times \exp(i\langle\mathbf{t},\mathbf{y}\rangle) \prod_{j=1}^ndt_j \prod_{\ell=1}^Lp_{S^+}(s_\ell)ds_\ell,
\end{align*}
\normalsize
where the second line follows by the derived expression for the characteristic function, and the third line follows by Fubini's theorem since all the integrals are real and finite. We recognize that the term $\exp(-(1/2)\mathbf{t}^T \mathbf{Q}_\mathbf{s} \mathbf{t})$ corresponds to a multivariate Gaussian density with covariance matrix $\mathbf{Q}^{-1}_\mathbf{s}$, though it is lacking the usual determinant term. We obtain the density using the characteristic function of Gaussian variables to finally obtain the result:
\small
\begin{align*}
    p(\mathbf{y}\mid \mathbf{X}) =& (2\pi)^{-n/2}\int_{(\mathbb{R}^+)^{L}}
    \exp\left(- \frac{1}{2}\mathbf{y}^T\mathbf{Q}_\mathbf{s}^{-1}\mathbf{y} \right)\\
    &\times \det(\mathbf{Q}_\mathbf{s})^{-1/2}\prod_{\ell=1}^Lp_{S^+}(s_\ell)ds_\ell.
\end{align*}
\normalsize
In using $\mathbf{Q}^{-1}_\mathbf{s}$ freely, we assumed through the previous steps that $\mathbf{Q}_\mathbf{s}$ is positive-definite. We proceed to prove this fact. Note that $\mathbf{Q}_\mathbf{s}$ is obtained from the sum of $L$ rank-one matrices and a diagonal matrix, where each of the rank-one matrices is $q_\ell^{2/\alpha}s_\ell \nu \boldsymbol{\tau}_\ell\boldsymbol{\tau}^T_\ell$. Let $\mathbf{w}\in\mathbb{R}^n\backslash \{0\}$. Then:
\small
\begin{align*}
    \mathbf{w}^T\mathbf{Q}_\mathbf{s}\mathbf{w} &= \mathbf{w}^T\left(\sigma^ 2 \mathbf{I} + \nu\sum_{\ell =1}^Ls_\ell q_{\ell}^{2/\alpha}\boldsymbol{\tau}_\ell\boldsymbol{\tau}_\ell^T\right)\mathbf{w}\\
    &= \sigma^2 \mathbf{w}^T\mathbf{w} + \nu\sum_{\ell=1}^L s_\ell q_\ell^{2/\alpha}\mathbf{w}^T\boldsymbol{\tau}_\ell\boldsymbol{\tau}^T_\ell\mathbf{w}\\
    &= \sigma^2 \sum_{j=1}^nw_j^2 +\nu\sum_{\ell = 1}^L s_\ell q_\ell^{2/\alpha}\left(\sum_{j=1}^nw_j\tau_\ell(x_j)\right)^2\\
    &>0,
\end{align*}
\normalsize
implying that $\mathbf{Q}_\mathbf{s}$ is positive-definite with probability 1.
\section*{SUPPLEMENTARY MATERIAL}
The Supplementary Material contains technical details and numerical results in pdf. Computer code is freely available at: \href{https://github.com/loriaJ/alphastableNNet}{https://github.com/loriaJ/alphastableNNet}. 
\section*{Acknowledgments}
Bhadra was supported by U.S. National Science Foundation Grant DMS-2014371.
\clearpage


\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

\bibliography{sample, hs-review}

\clearpage
\onecolumn
	\title{Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance \\ (Supplementary Material)}
    \maketitle
    
	% \pagestyle{empty}
	\setcounter{page}{0}
	\renewcommand\thepage{S.\arabic{page}} 
	\setcounter{page}{1}
	
	
	\setcounter{table}{0}
	\renewcommand{\thetable}{S\arabic{table}}%
    \renewcommand{\theHtable}{Supplement.\thetable}

	\setcounter{figure}{0}
	\renewcommand{\thefigure}{S\arabic{figure}}
    \renewcommand{\theHfigure}{Supplement.\thefigure}
	\setcounter{section}{0}
	\renewcommand{\thesection}{S.\arabic{section}}
    \renewcommand{\theHsection}{Supplement.\thesection}
 
	\setcounter{algorithm}{0}
	\renewcommand{\thealgorithm}{S.\arabic{algorithm}}
    \renewcommand{\theHalgorithm}{Supplement.\thealgorithm}

\setcounter{theorem}{0}
	\renewcommand{\thetheorem}{S.\arabic{theorem}}

 
    % \appendix 
\section{Some relevant properties of \texorpdfstring{$\alpha$}{alpha}-stable random variables}\label{sup:alpha_st}
%
One of the most important properties of stable random variables is the closure property \citep[Property 1.2.1,][]{SamorodnitskyTaqqu}, which states that if $X_i\sim S(\alpha,\nu_i,\beta_i)$ independently for $i=1,2$, then $X_1+X_2 \sim S(\alpha,\xi,\gamma)$, where $\gamma = (\beta_1  \nu_1^{\alpha} + \beta_2  \nu_2^{\alpha})/(\nu_1^{\alpha}+\nu_2^{\alpha})$, and $\xi = ( \nu_1^{\alpha}+ \nu_2^{\alpha})^{1/\alpha}$. This means that the sum of two $\alpha$-stable variables is again $\alpha$-stable. This is a generalization of the well known property of the closure under convolutions of Cauchy ($\alpha=1$) and Gaussian ($\alpha=2$) random variables. In terms of moments, \citet[Property 1.2.16]{SamorodnitskyTaqqu} indicate that for $X\sim S(\alpha,\beta,\nu)$ with $\alpha\in(0,2)$, we have $\mathbb{E}[\lvert X \rvert^{r}] =\infty$ for $r\ge \alpha$ and $\mathbb{E}[\lvert X \rvert^{r}]$ is finite for $0<r<\alpha$. Specifically, this implies that $\alpha$-stable random variables have infinite variance, when $\alpha<2$. 

The property of closure under convolutions is easily generalized to the sum of a sequence of i.i.d. $\alpha$-stable variables, which gives rise to a convergence in a non-Gaussian domain, for $\alpha<2$. Formally, the \emph{generalized} central limit theorem \citep{gnedenko1954limit} proves that for i.i.d. scaled random variables with infinite variance the convergence is no longer to a Gaussian random variable. Rather the convergence is to an $\alpha$-stable random variable. A statement of the theorem is below. 
\begin{theorem}\citep[Generalized central limit theorem,][p. 62]{UchaikinZolotarev}
    Let $X_1,\dots,X_n$ be independent and identically distributed random variables with cumulative distribution function $F(x)$ satisfying the conditions:
    \begin{align*}
        1-F(x) &\sim c\lvert x\rvert ^{-\gamma},\; x\to\infty,\\
        F(x) & \sim d \lvert x\rvert^{-\gamma}, \; x\to \infty,
    \end{align*}
    with $\gamma>0$. Then there exists sequences $a_n \in \mathbb{R}$ and $b_n>0$, such that the distribution of the centered and normalized sum:
    \begin{align*}
        Z_n &= b_n^{-1}\left(\sum_{i=1}^n X_n - a_n\right),
    \end{align*}
    weakly converges to $S(\alpha,1,\beta)$ as $n\to\infty$, where $\alpha=\min(\gamma,2)$, $\beta=(c-d)/(c+d)$, and $a_n$ and $b_n$ are as given in  Table~\ref{tab:params_gclt}.
\end{theorem}
\begin{table}[]
    \centering
    \caption{Parameters for the generalized central limit theorem.}
    \begin{tabular}{ccc}
         \toprule $\gamma$  & $a_n$ & $b_n$  \\\midrule
         $\gamma \in (0,1)$ &  0  & $[\pi(c+d)]^{1/\gamma}[2\Gamma(\gamma)\sin(\gamma\pi/2)]^{-1/\gamma}n^{1/\gamma}$ \\
         $\gamma =1$        & $\beta(c+d)n\ln(n)$ & $(\pi/2)(c+d)n$  \\
         $\gamma \in(1,2)$  &  $n\mathbb{E}[X]$  & $[\pi(c+d)]^{1/\gamma}[2\Gamma(\gamma)\sin(\gamma\pi/2)]^{-1/\gamma}n^{1/\gamma}$\\
         $\gamma = 2$       & $n\mathbb{E}[X]$   & $(c+d)^{1/2}[n\ln(n)]^{1/2}$ \\
         $\gamma > 2$       & $n\mathbb{E}[X]$   & $[(1/2)\mathrm{Var}(X)]^{1/2}n^{1/2}$\\\bottomrule
    \end{tabular}
    \label{tab:params_gclt}
\end{table}

Finally, the Laplace transforms of positive $\alpha$-stable random variables exist. For $\alpha<1$ and $X\sim S(\alpha,1,1)$ the Laplace transform is given by:
\begin{equation*}
    \mathbb{E}[\exp(-\lambda X)] = \exp(-\lambda^{\alpha}),
\end{equation*}
for $\lambda >0$. 

\section{Computation of \texorpdfstring{$q_\ell$}{q ell} and \texorpdfstring{$\tau_\ell$}{tau ell}}\label{sup:comp_q_s_general}
Since the size of $\Lambda$ is $2^n$, indicating an exponential complexity of na\"ive enumeration, we further simplify Equation~\eqref{eq:expectation}. When the input points are arranged in a way that $\tau_\ell$ is not possible, then  $q_\ell$ must be zero. We can identify the elements in $\Lambda$ with positive probabilities by considering arbitrary values of $b_0,w_1,\dots,w_I$. The corresponding $\tau$ function is determined by: ${\tau(\mathbf{x}_j)=\mathrm{sign}(b_0 + w_1x_{j1}+\cdots+w_Ix_{jI})}$, for each $j$. This corresponds to labeling with $+1$ the points above a hyperplane, and with $-1$ the points that lie below the hyperplane. When $I=1$, without loss of generality, let $x_1<\cdots<x_n$. In this case $\Lambda$ corresponds to the possible changes in sign that can occur between the input variables, which is equal to $n$. Specifically, the sign change can occur before $x_1$, between $x_1$ and $x_2$, \dots, between $x_{n-1}$ and $x_n$, and after $x_n$. A similar argument is made in Example 2.1.1 of \citet{DerLee2005}, suggesting the possibility of considering more than one dimensions.

For $I>1$, \citet{harding1967number} studies the possible partitions of $n$ points in $\mathbb{R}^I$ by an $(I-1)$-dimensional hyperplane---which corresponds to our problem, and determines that for points in general configuration there are $O(n^I)$ partitions. \citet{GoodmanPollack} give an explicit algorithm for finding the elements of $\Lambda$ that have non-zero probabilities in any possible configuration. We summarize their algorithm for $I=2$ as Algorithm~\ref{alg:goodman} in Supplementary Section~\ref{sup:algs}. This algorithm runs in a computational time of order $n^I\log(n)$, which is reasonable for moderate $I$. For the rest of the article, we denote the cardinality of elements in $\Lambda$ that have positive probability by $L$, with the understanding that $L$ will depend on the input vectors $\mathbf{x}_i$ that are used and their dimension. This solves the issue of computing deterministically the values of $\tau_\ell$ that have positive probability after integrating through input-to-hidden weights.

Next we compute the probability $q_\ell$ for the determined $\tau_\ell$. For $I=1$, the $q_\ell$s correspond to probabilities obtained from a Cauchy cumulative density function, which we state explicitly in Supplementary Section~\ref{sup:probs_1d}. For general dimension of the input $I>1$, the value of $q_\ell$ is given by $\mathbb{P}(\mathbf{Z}^{(\tau_\ell)} > 0)$, where the $n$-dimensional Gaussian vector $\mathbf{Z}^{(\tau_\ell)}$ has $i$-th entry given by $\tau_\ell(\mathbf{x}_i)(b_0 + \sum_{j=1}^Iw_jx_{ij})$. This implies that $\mathbf{Z}^{(\tau_\ell)}\sim \mathcal{N}_{n}(0,\boldsymbol{\Sigma}^{(\tau_\ell)})$, where the (possibly singular) variance matrix is ${\Sigma^{(\tau_\ell)}_{i,j} = \tau_\ell(\mathbf{x}_i)\tau_\ell(\mathbf{x}_j)(1+\sum_{k=1}^nx_{ik}x_{jk})}$. 
This means we can compute $q_\ell=\mathbb{P}(\mathbf{Z}^{(\tau_\ell)}>0)$ using for example the \texttt{R} package \texttt{mvtnorm}, which implements the method of \citet{BretzGenz} for evaluating multivariate Gaussian probabilities. Since the $q_\ell$s require independent procedures to be computed, this is easily parallelized after obtaining the partitions.
    
\subsection{Computation of \texorpdfstring{$q_\ell$}{q ell} and \texorpdfstring{$\tau_\ell$}{tau ell} in one  dimension}\label{sup:probs_1d}
Assume, since we are in one dimension, that $x_1<\cdots<x_n$. In the one-dimensional case $\Lambda$ consists of the different locations where the change in sign can be located. This corresponds to: 
 \begin{enumerate}
     \item Before the first observation, which corresponds to $\tau(x_k) = 1$ for all $k$, we call this $\tau_0$.
     \item Between $x_j$ and $x_{j+1}$ for some $j=1,\dots,n-1$, which corresponds to $\tau(x_k) = -1$ for $k<j$ and $\tau(x_k)=+1$ otherwise, we call this $\tau_j$.
     \item After $x_n$, $\tau(x_k)=-1$ for all $k$, and we call this $\tau_n$.
 \end{enumerate}
Note that the first and last items correspond to linearly dependent vectors as $\tau_0(x_k) = -\tau_n(x_k)$, for all $k=1,\dots,n$. Now, we compute the probability for the first and last items:
\begin{align*}
    q_{\tau_0} + q_{\tau_n} &= \mathbb{P}(b_0 + w_1 x_1 < 0)+\mathbb{P}(b_0 + w_1 x_n > 0 ) \\
    &= \mathbb{P}(x_1 < -b_0/w_1)+ \mathbb{P}(x_n > -b_0/w_1)\\
    &= \mathbb{P}(x_1 < - C) + \mathbb{P}(x_n > - C)\\
    &= \frac{1}{2} + \frac{1}{\pi} \arctan(x_1) + \frac{1}{2}-\frac{1}{\pi}\arctan(-x_n)\\
    &= 1 + \frac{1}{\pi}(\arctan(x_1) -\arctan(-x_n)),
\end{align*}
 where $C\sim \mathrm{Cauchy}(0,1)$ since $b_0,w_1$ are independent standard normal variables.

 Next, the change in sign occurring between $x_j$ and $x_{j+1}$, yields:
 \begin{align*}
     q_{\tau_j} &= \mathbb{P}(\mathrm{sign}(b_0 + w_1 x_j) = -1, \; \mathrm{sign}(b_0 + w_1 x_{j+1}) = 1) \\
     &= \mathbb{P}(b_0 + w_1 x_j < 0,\; b_0 + w_1 x_{j+1} > 0)\\ 
     &= \mathbb{P}(x_j < -b_0/w_1,\; x_{j+1} > -b_0/w_1)\\
     &= \mathbb{P}(x_j < C < x_{j+1})\\
     &= \frac{\arctan(x_{j+1}) - \arctan(x_j)}{\pi},
 \end{align*}
which corresponds to the desired probability. 

\section{Supporting algorithms}\label{sup:algs}
Algorithm~\ref{alg:goodman} is modified from \citet{GoodmanPollack} for $I=2$ dimensions, which we employ in Algorithm~\ref{alg:samp_full_d}, and has a computational complexity of $\mathcal{O}(n^I\log(n))$ for $n$ input points and a general input dimension $I$. We present Algorithm~\ref{alg:samp_Q} to sample the latent scales $\{s_\ell\}_{\ell=1}^L$ and error standard deviation $\sigma$, which consists of a independent samples Metropolis--Hastings procedure where we sample from the priors, and iteratively update the matrix $\mathbf{Q}_\mathbf{s}$. For computation of the density functions we use the Woodbury formula, and an application of the matrix-determinant lemma, for an efficient update of $\mathbf{Q}_\mathbf{s}$ and to avoid computationally intensive matrix inversions. Algorithm~\ref{alg:samp_Q} has computational complexity of $\mathcal{O}(Ln^2)=\mathcal{O}((n+m)^In^2)$.

\begin{algorithm}[!h]
    \caption{\citep{GoodmanPollack}. Multidimensional sorting for $I=2$} 
    \label{alg:goodman}
     \begin{algorithmic}[-1] 
        \Require Matrix $\mathbf{X}\in\mathbb{R}^{n\times 2}$. \\
        \hspace{-0.6cm}\textbf{Output:} partition vectors $\{\tau_\ell\}_{\ell =1}^L$.
    \For{$i = 1,\dots,n-1$}
    \For{$j = i+1,\dots,n$}
        \State Let $u_j=x_{j,1} - x_{i,1}$, and $v_j = x_{j,2} - x_{i,2}$. If $(u_j,v_j)=(0,0)$ call $j$ ``good''.
        \State Let $u_{n+j} = -u_{j}, v_{n+j}=-v_j$, and let $m_j=m_{n+j}=v_j/u_j$.
    \EndFor
    \State Sort the indices $\{j : j \text{ is good}\}\cup \{n+j: j \text{ is good}\}$ into subsets:
    \begin{enumerate}
        \item for those for which $u_j > 0$, using $m_j$ as key \label{it:one}
        \item for those for which $u_j = 0$ and $v_j>0$
        \item for those for which $u_j < 0$, using $m_j$ as key \label{it:three}
        \item for those for which $u_j = 0$, and $v_j<0$
    \end{enumerate}
    \State From the sorting in Items~\ref{it:one} and~\ref{it:three} we obtain a list of subsets. Say: $\{J_{11},\dots,J_{1p_1},\dots,J_{r1},\dots,J_{rp_r}\}$, where the points with indices $J_{k1},\dots, J_{kp_k}$ constitute an entire subset, and denote $J^{(k)}$ as their union, and there are $r$ subsets all together. Denote by $k(j)$ the number of the subset within which $j$ lies.
    \State For each $k=1,\dots,r$, let: $n_k = \#\{m : 1 \leq m, J_{km} \leq n\}$, the number of points in each ray.
    \State For each good $j$, consider $A_0^{(ij)}=\{i\} \cup (J^{(k(j))} -\{n+1,\dots,2n\})$ as the points in the same ray as $ij$.
    \If{$k(n+j) > k(j)$}
        \State Define the points in the positive side by: $A_+^{(ij)} = \cup_{k=k(j)+1}^{k(n+j)-1} J^{(k)}-\{n+1,\dots,2n\}$, the points in the negative side by: $A_-^{(ij)} = \{1,\dots,n\} - A_+^{(ij)} - A_0^{(ij)} $.
    \ElsIf{$k(n+j) < k(j)$}
        \State Define the points in the positive side by: $A_+^{(ij)}= \cup_{k=k(j)+1}^{r} J^{(k)}\cup \cup_{k=1}^{k(n+j)-1} J^{(k)}$, and the points in the negative side by: $A_-^{(ij)} = \{1,\dots,n\} - A_+^{(ij)} - A_0^{(ij)}$.
    \EndIf
    \State For each $j=i+1,\dots,n$, if $A_0^{(ij)}$ has $L_j$ ordered items denoted by $\{a_{m}:m =1,\dots,L_j\}$. Add $2L_j$ vectors: $\boldsymbol{\tau}_{\ell}$ for $\ell =1,\dots, L_j$, with entry $k$ equal to $+1$ if $k\in A_{+}^{(ij)} \cup \{a_m:m =1,\dots,\ell\}$, and $-1$ otherwise, and the vectors $\boldsymbol{\tau}_{\ell+L_j}$ for $\ell =1,\dots, L_j$, with entry $k$ equal to $+1$ if $k\in A_{+}^{(ij)} \cup \{a_m:m =\ell+1,\dots,L_j\}$, and $-1$ otherwise. 
    \State Repeat using $A_+^{(ij)}$ as the negative, and $A_-^{(ij)}$ as the positive.
    \EndFor
    \State Discard repeated vectors.
    \State \Return the collection of vectors $\{\boldsymbol{\tau}_\ell\}_{\ell=1}^L$
    \end{algorithmic}
\end{algorithm}


\begin{algorithm}[H]
    \caption{A Metropolis--Hastings sampler for $\mathbf{Q}$ by simulating the latent scales from the prior}\label{alg:samp_Q}
		\begin{algorithmic}[-1]
    \Require Vector $\mathbf{y}\in\mathbb{R}^n$, previous latent scales $\{s_{\ell}\}_{\ell = 1}^L$, vectors $\{\boldsymbol{\tau}_\ell\}_{\ell = 1}^L$, partition probabilities $\{q_\ell\}_{\ell=1}^L$, previous variance matrix ${\mathbf{Q}=\nu\sum_{\ell = 1}^Ls_\ell q_\ell^{2/\alpha}\boldsymbol{\tau}_\ell\boldsymbol{\tau}_\ell^T+\sigma^2 \mathbf{I}}$, and magnitude of errors $\sigma^2$. \\
    \hspace{-0.6cm}\textbf{Output:} Updated $\mathbf{Q}$ matrix.
	\For{$k=1,\dots,L$}
		\State Propose $s_k^*\sim S^+(\alpha/2)$.
		\State Define $\mathbf{Q}^{(prop)}=\nu\sum_{\ell \neq k}s_\ell q_\ell^{2/\alpha}\boldsymbol{\tau}_\ell\boldsymbol{\tau}_\ell^T+ \nu s_k^*q_k^{2/\alpha}\boldsymbol{\tau}_k\boldsymbol{\tau}_k^T+\sigma^2 \mathbf{I}$ .
		\State Accept $s_k^*$ with probability $\min\{p(\mathbf{y}\mid \mathbf{Q}^{(prop)}_{1:n,1:n})/ p(\mathbf{y}\mid \mathbf{Q}_{1:n,1:n}),1\}$.\label{step:log_ratio}
        \State If $s_k^*$ is accepted, replace $s_k$ by $s_k^*$.
		\EndFor
    \State Propose $\sigma^2_*\sim \mathrm{Cauchy}^+(0,1)$.
    \State Compute ${\mathbf{Q}^{(prop)}=\nu\sum_{\ell = 1}^Lq_\ell^{2/\alpha}s_\ell\boldsymbol{\tau}_\ell\boldsymbol{\tau}_\ell^T+\sigma^2_* \mathbf{I}}$.
    \State Accept $\mathbf{Q}^{(prop)}$ with probability $\min\{p(\mathbf{y}\mid \mathbf{Q}^{(prop)}_{1:n,1:n})/ p(\mathbf{y}\mid \mathbf{Q}_{1:n,1:n}), 1\}$.
	\State \Return $\mathbf{Q}^{(prop)}$ if it was accepted, \textbf{return} $\mathbf{Q}$ otherwise.
  \end{algorithmic}
\end{algorithm}


\section{Additional numerical results}\label{sup:simulations}
We report MCMC convergence diagnostics, run times, and posterior quantiles of $\mathbf{Q}_\mathbf{s}\mid \mathbf{y}$. We also include simulation results for a variety of functions in one and two dimensions, where we demonstrate the flexibility of the proposed method.

\subsection{MCMC diagnostics and computation times}

\begin{figure}[!htb]
    \centering
    \includegraphics[width =\linewidth]{images/mcmc_traceplots2.jpg}
    \caption{Trace plots for the proposed MCMC sampler (Algorithm~\ref{alg:samp_full_d}) for the simulations in Section~\ref{sec:nums}, indicating good mixing in about 1000 burn-in iterations. \emph{Left:} one dimensional case, \emph{Right:} two dimensional case. Numerical results were obtained using the last 2000 iterations.}
    \label{fig:trace_plots}
\end{figure}


\begin{table}[!htb]
    \centering
    \caption{Total (in seconds) and per iteration (in milliseconds) Computation Times for the Simulations in Section~\ref{sec:nums}, for the Competing Methods.} 
    \label{tab:times_simulations}
    \begin{tabular}{c c c c c c }
    \toprule
                 &     & Stable & GP MLE & GP Bayes & Bayes NNet  \\ \midrule
Total time (s) & 1-d & 24.047 & 0.067 & 18.073 & 6.91 \\
             & 2-d & 1339.543 & 0.178 & 22.185 & 7.193 \\ \midrule
      Per iteration time (ms) & 1-d & 8.016 & 13.400 & 0.602 & 2.303 \\
        & 2-d & 446.514 & 35.600 & 0.740 & 2.398 \\
    \bottomrule
    \end{tabular}
\end{table}

\subsection{Posterior quantiles of \texorpdfstring{$\mathbf{Q}_\mathbf{s}\mid \mathbf{y}$}{Qs}}\label{sec:post_Qs}
We present results showing the posterior $\mathbf{Q}_\mathbf{s} \mid \mathbf{y}$ is non-degenerate under a stable limit. In the case of a vanilla GP limit the prior on the kernel converges to a point mass, resulting in a degenerate posterior \citep{pmlr-v139-aitchison21a,pmlr-v202-yang23k}. However, as shown by our Corollary~\ref{cor:random_Q}, $\mathbf{Q}_{\mathbf{s}}$ is stochastic under a stable limit. Figure~\ref{fig:stoch_q} displays the posterior 25th, 50th, and 75th quantiles of $\mathbf{Q}_{\mathbf{s}}\mid \mathbf{y}$ for the examples shown in Sections~\ref{sec:1d} and~\ref{sec:2d}, confirming the posterior of $\mathbf{Q}_\mathbf{s} \mid \mathbf{y}$ is non-degenerate. This is a key feature that distinguishes the current work from prior works on GP limits.
\begin{figure}[!h]
    \centering   
    \includegraphics[scale=0.1]{images/posterior_all_qs.jpg}
    \caption{Posterior quantiles (\emph{left:} 25, \emph{center:} 50, \emph{right:} 75) of the kernel $\mathbf{Q}_\mathbf{s}$, for the 1-$d$ (\emph{upper})  and 2-$d$ (\emph{lower})  examples, clearly showing a non-degenerate posterior for $\mathbf{Q}_\mathbf{s}$.}
    \label{fig:stoch_q}
\end{figure}

\subsection{Additional results in one dimension}
We show, using a variety of one-dimensional functions, that the Stable procedure results in better performance in presence of discontinuities, while performing similarly to GP-based methods or finite width networks for smooth functions. Posterior uncertainty quantification results are omitted. 
    \paragraph{One-dimensional one-jump function.}
    Consider the function with a single jump given by $f(x)=5\times \mathbf{1}_{\{ x> 0\}}$. We use forty equally-spaced observations between $-2$ and $2$ with a Gaussian noise with standard deviation of $0.5$. We display the obtained results on Figure~\ref{fig:one_one_jump}, with optimal hyper-parameters $\alpha^*=1.1$ and $\nu^*=1$.
    \begin{figure}[!h]
        \centering
        \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1.1_sigma_0.5_sigma_unk_nu_1_type_1_dim_1.jpg}
        \caption{Out-of-sample error comparison and predictions with scatter plot of the observations for the four methods for a function with a single jump.}
        \label{fig:one_one_jump}
    \end{figure}
    \paragraph{One-dimensional two-jump function.}
    Consider the function with two jumps given by $f(x)=5\times \mathbf{1}_{\{-2/3\leq x< 2/3\}}$. We use forty equally-spaced observations between $-2$ and $2$ with a Gaussian noise with standard deviation of $0.5$. We display the obtained results on Figure~\ref{fig:one_d_two_jump}, with optimal hyper-parameters $\alpha^*=1.3$ and $\nu^*=1$.
    \begin{figure}[!h]
        \centering
        \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1.3_sigma_0.5_sigma_unk_nu_1_type_2_dim_1.jpg}
        \caption{Out-of-sample error comparison and predictions with scatter plot of the observations for the four methods for a function with two jumps.}
        \label{fig:one_d_two_jump}
    \end{figure}
    
    \paragraph{One-dimensional piece-wise smooth.}
    Consider the piece-wise smooth function with a single jump, given by 
    \begin{align*}
     f(x)=\begin{cases}
     -2x^2 + 8, & x\geq 0, \\
     -3 x +2, & x < 0.
     \end{cases}
    \end{align*}
    We use forty equally-spaced observations between $-2$ and $2$ with a Gaussian noise with standard deviation of $0.5$, and display the obtained results in Figure~\ref{fig:one_d_pw_smooth},  using the optimal hyper-parameters $\alpha^*=1$ and $\nu^*=1$.
    \begin{figure}[!htb]
        \centering
        \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1_sigma_0.5_sigma_unk_nu_1_type_3_dim_1.jpg}
        \caption{Out-of-sample error comparison and predictions with scatter plot of the observations for the four methods for a piece-wise smooth function.}
        \label{fig:one_d_pw_smooth}
    \end{figure}
    \paragraph{One-dimensional smooth function.}
    Finally, consider the smooth function $f(x)=-2\cos(x)^2 + 3\tanh(x) -2x$. We use forty equally-spaced observations between $-2$ and $2$ with a Gaussian noise with standard deviation of $0.5$. The obtained results are shown in Figure~\ref{fig:smooth_1d}. The optimal hyper-parameters are $\alpha^*=1.9$ and $\nu^*=1$.
    \begin{figure}[!htb]
        \centering
        \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1.9_sigma_0.5_sigma_unk_nu_1_type_5_dim_1_2.jpg}
        \caption{Out-of-sample error comparison and predictions with scatter plot of the observations for the four methods for a smooth function.}
        \label{fig:smooth_1d}
    \end{figure}

    \subsection{Additional results in two dimensions}
    We show, using a variety of two-dimensional functions, that the Stable procedure results in better performance in presence of discontinuities, while performing similarly to GP-based methods or finite width networks for smooth functions. Posterior uncertainty quantification results are available, but omitted. 
\paragraph{Two-dimensional one-jump function.}
Consider the function $f(x_1,x_2) = 5\times \mathbf{1}_{\{x_1 + x_2 > 0\}}$. Using the grid of points on $[-1,1]^2$ detailed in Section~\ref{sec:nums}, and additive Gaussian noise with $\sigma=0.5$, we obtain the predictions results as shown in Figure~\ref{fig:one_jump_2d}.  Optimal hyper-parameters are $\alpha^*=0.1$ and $\nu^*=1$.

\begin{figure}[!htb]
    \centering
    \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_0.3_sigma_0.5_sigma_unk_nu_1_type_1_dim_2_2.jpg}
    \caption{Out-of-sample error comparison and predictions for the four methods for a jump function in two-dimensions.}
    \label{fig:one_jump_2d}
\end{figure}

\paragraph{Two-dimensional smooth edge.}
Consider the function $f(x_1,x_2) = 5\times \mathbf{1}_{\{x_1^2 + 2x_2 -0.4 > 0\}}$. Note that the jump boundary is determined by a smooth curve. Using the grid of points on $[-1,1]^2$ detailed in Section~\ref{sec:nums}, and additive Gaussian noise with $\sigma=0.5$, we obtain the predictions results as shown in Figure~\ref{fig:smooth_jump_2d}. Optimal hyper-parameters are $\alpha^*=1$ and $\nu^*=1$.
The Stable method is able to capture the smoothness of the jump boundary without losing predictive power.

\begin{figure}[!htb]
    \centering
    \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1_sigma_0.5_sigma_unk_nu_1_type_2_dim_2_2.jpg}
    \caption{Out-of-sample error comparison and predictions for the four methods for a function with a jump that is parameterized by a smooth function.}
    \label{fig:smooth_jump_2d}
\end{figure}

\paragraph{Two-dimensional smooth function.}
Consider the smooth function $f(x_1,x_2) = x_1^2 + x_2^2 - x_1x_2$. We use the grid of points on $[-1,1]^2$ detailed in Section~\ref{sec:nums}, and additive Gaussian noise with $\sigma=0.5$. Since this function is continuous, it would be expected that the Stable method would perform similarly as the competing methods. We obtain the predictions results as shown in Figure~\ref{fig:smooth_2d}. The optimal hyperparameters are $\alpha^*=1.3$ and $\nu^*=1$.
\begin{figure}[!htb]
    \centering
    \includegraphics[width=\linewidth]{images/merge_gen_case_alpha_1.3_sigma_0.5_sigma_unk_nu_1_type_5_dim_2_2.jpg}
    \caption{Out-of-sample error comparison and predictions for the four methods for a smooth function in two-dimensions.}
    \label{fig:smooth_2d}
\end{figure}


\section{Ablation study on \texorpdfstring{$\alpha$}{alpha} and \texorpdfstring{$\nu$}{nu}}\label{sec:ablation}
We perform an ablation study on the tuning parameters $\alpha$ and $\nu$ for the examples we consider in Section~\ref{sec:nums}. Figure~\ref{fig:ablation} displays the mean absolute error for a grid of the tuning parameters in all two examples. The results show that the smaller $MAE$s are mostly concentrated close to $\nu=1$. These results hint that $\nu=1,\alpha=1$ are good default choices.
\begin{figure}[!h]
    \centering
    \includegraphics[scale=0.1]{images/ablation_1d_2d.jpg}
    \caption{Ablation study for the two numerical examples. Displaying the mean absolute error for varying $\alpha$ and $\nu$ parameters. \emph{Left:} one dimension. \emph{Right:} two dimensions.}
    \label{fig:ablation}
\end{figure}

\section{Results on S\&P 500}\label{sec:common_data}
We provide out of sample prediction results for S\&P 500 closing prices\footnote{Obtained from \url{https://www.nasdaq.com/market-activity/index/spx/historical} between July 1, 2019 and June 30, 2021.}, using $336$ and $169$ as the training and testing set sizes respectively, with $\alpha^*=1.9$ and $\nu^*=1$. Table~\ref{tab:mae_SP} displays the improved performance from using the Stable method compared to the three competing methods, as measured by the mean absolute error.
\begin{table}[!h]
    \centering
    \caption{Mean absolute error of predictions and standard errors computed on 10 random training--testing splits in the S\&P 500 closing prices by method.} 
    \label{tab:mae_SP}
       \begin{tabular}{c c c c c}
     \toprule
          & Stable & GP MLE & GP Bayes & Bayes NNet \\ \hline
         MAE & 0.054 & 0.071 & 0.054 & 0.210 \\
         (SE) & (0.008) & (0.006)  &  (0.005)  & (0.016)\\
    \bottomrule
    \end{tabular}
\end{table}

\end{document}