\documentclass[
  journal=proceedings,
  manuscript=article-type,
  year=2024
]{PMET_proc}

\usepackage{amsmath}
\usepackage[nopatch]{microtype}
\usepackage{booktabs}
\newcommand{\proglang}[1]{{\normalfont\fontseries{b}\selectfont #1}}

\title{Incorporating Sparsity into Bayesian Stacking Procedures}
\author{Kjorte Harra}
\affiliation{Department of Educational Psychology, University of Wisconsin-Madison, Madison, WI, United States}
\email[K. Harra]{harra@wisc.edu}

\author{David Kaplan}
%\affiliation{Second Division, Organization, City, Pincode, State, Country}
%\alsoaffiliation{Joint first authors}

% \author{T. Author}
% \affiliation{Second Division, Organization, City, Pincode, State, Country}

% \author{F.T. Author}
% \affiliation{Fourth Division, Organization, City, Pincode, State, Country}

\addbibresource{example.bib}

\keywords{Bayesian regularization, Bayesian stacking, predictive performance} %% First letter not capped

\begin{document}

\begin{abstract}
Bayesian stacking is a procedure adapted from machine learning that allows researchers to combine multiple unique models and optimize overall predictions, with the added benefit of not relying on strong assumptions necessary for Bayesian model averaging (BMA). For individual models, Bayesian regularization methods via sparsity-inducing priors elicit stronger predictive accuracy than unregularized modeling approaches. While model stacking is not intended to serve as a method for performing variable selection, we are unaware of any systematic investigation examining how sparsity-inducing priors applied to member models in a stack could conceivably lead to more accurate predictions. The present work investigates whether the addition of Bayesian regularization via sparsity-inducing priors of individual member models can be a worthwhile practice when using Bayesian stacking procedures. Against our expectations, we find that inducing sparsity in stacking member models does not improve predictive performance. Other results and limitations of this work are also discussed.
\end{abstract}

 
To optimize predictive performance for a given outcome, there are many approaches researchers can take. Bayesian stacking, a model ensembling procedure adopted from machine learning, optimizes predictions by combining multiple unique models \citep{Breiman1996, Wolpert1992, yao2018, ClydeIversen2013}. Bayesian stacking forms a weighted mixture of predictive distributions from an ensemble of individual models. This Bayesian model ensembling method is an improvement over the more classical approach of \emph{Bayesian model averaging} (BMA) \citep{madiganraftery94,draper95,hoeting99} in that Bayesian stacking does not assume that the true data generating model is in the space of models being averaged, and is theoretically expected to yield stronger predictive performance than that of any single model chosen for predictive purposes.

Another approach known to boost predictive performance is Bayesian regularization. Otherwise known as sparsity-inducing priors, these methods have demonstrated improved model accuracy and predictive performance under many modeling methods as compared to unregularized approaches, particularly with small samples \citep{Harra2023, Jacobucci2018, vanErp2019}. Sparsity-inducing priors, or shrinkage priors, such as the lasso \citep{Tibshirani1996} and horseshoe priors \citep{carvalho2009, carvalho2010, piironen2017} can perform variable selection and introduce model simplicity without sacrificing model performance. Although these methods have been well studied for individual model performance, it remains unclear whether these methods could also benefit modeling ensembling methods such as Bayesian stacking.

While incorporating sparsity through Bayesian regularization has been hypothesized to improve prediction accuracy \citep{Breiman1996, yao2018, loo2023}, this remains an open question, particularly with the use of newer priors such as the regularized horseshoe prior \citep{piironen2017}. Our present work seeks to investigate the potential benefits, if any, of incorporating sparsity into member models within Bayesian stacking procedures for improving predictive accuracy. The following sections will provide the necessary context for this work, and then we will explore this via a full simulation study comprised of a stack of Bayesian linear regression models. 

\section{Bayesian Stacking}

Model stacking is essentially a weighted combination of predictions from a set of specified $K$ models ($k = 1, 2, ..., K$). Model predictions are combined (stacked) to yield a weighted combination of predictive distributions \citep{kaplan2024}. This method of model ensembling was originally developed in the machine learning literature by \citep{Wolpert1992} and \citep{Breiman1996} and brought into the Bayesian framework by \citeauthor{ClydeIversen2013} (\citeyear{ClydeIversen2013}).
    
We can define a set of weights on a simplex as

\vspace{-6pt}
    \begin{equation}
        \mathcal{W}_1^K = \Biggl\{w \in [0,1]^K : \sum_{k=1}^K w_k =1\Biggr\}. 
    \end{equation}

    %\vspace{-6pt}
        
To approximate the full predictive distribution, $p(\tilde{y_i}|y_i,M_k)$, we use the leave-one-out (LOO) predictive distribution where \citep{yao2018} 

\vspace{-6pt}

    \begin{equation}
        \hat{p}_{k,-1}(y_{i}) = \int p(y_{i}|\theta_k, M_k)p(\theta_k|y_{-i},M_k)d\theta_k.
    \end{equation}

    %\vspace{-6pt}
    
The stacking weights using the log score are the solution to 

\vspace{-6pt}

    \begin{equation}\label{logscoreW}
        \max_{w\in\mathcal{W}_1^K}\frac{1}{n}\sum_{i=1}^n \text{log}\sum_{k=1}^K w_k\hat{p}(y_{i}|y_{-i},M_k). 
    \end{equation}

Various weighting methods are available. $\mbox{ELPD}_{loo}$ weighting is based on the $\mbox{ELPD}$ (expected log point-wise predictive density) of a model, which is our primary focus for this paper. Other weighting methods include Pseudo-BMA (PBMA) and Pseudo-BMA+ (PBMA+) \citep{yao2018}. However, preliminary analyses for this work demonstrated no noteworthy differences in performance between weighting strategies, so the remainder of this work will implement $\mbox{ELPD}_{loo}$ weighting.

% \vspace{-12pt}

%  \begin{equation}
%     \prod_{i=1}^n p(y_{i}|y_{-i},M_k).        
% \end{equation}

% Pseudo-BMA+ (PBMA+) better accounts for the uncertainty of the LOO estimation of weights via combining the Bayesian bootstrap with $\mbox{ELPD}$ weighting, where the final weight for model $k$ given a bootstrap sample $b$ \citep{yao2018}: 

% \vspace{-12pt}

%  \begin{equation}
%    w_k = \frac{1}{B}\sum_{b=1}^B w_{k,b}.
% \end{equation}

% \vspace{-6pt}

\section{Overview of Bayesian Regularization}

Bayesian regularization penalizes small regression coefficients by attaching a prior distribution to model parameters \citep{Jacobucci2018}. Many regularization priors are available, beginning with the ridge prior \citep{Hsiang75} that seeks to shrink parameters close to zero and minimize collinearity. The Bayesian lasso \citep{Park2008} improves upon the ridge prior as it enables shrinkage of coefficients to zero, allowing for variable selection. 

The Bayesian ridge and lasso priors, described below, are extensions of frequentist methods to the Bayesian context. Strictly Bayesian approaches include the horseshoe prior \citep{carvalho2009, carvalho2010}, which allows for greater shrinkage than the ridge and the lasso while maintaining unregularized large coefficients. The regularized horseshoe \citep{piironen2017} prevents large coefficients from escaping shrinkage, allows further flexibility than the original horseshoe prior, and has been shown to further improve model predictive performance \citep{piironen2017, Harra2023}.

Previous research has shown that Bayesian regularization can perform as well as, if not better than, classical methods of regularization in linear regression \citep{vanErp2019}. This finding has not been extended to ensemble modeling methods such as Bayesian stacking, particularly with a focus on optimizing out-of-sample predictive performance. Thus, this paper focuses on the performance of three Bayesian regularization priors, particularly the regularized horseshoe, in the context of Bayesian stacking procedures for linear regression. We investigate this via a simulation study comparing several regularized model stacks to unregularized model stacks in terms of the amount of shrinkage induced and out-of-sample predictive performance.


\subsection{Priors to be investigated}

Figure \ref{shrinkplots} shows the density plots for the three regularization priors that we will be studying in this paper.

\begin{figure}
    \centering
    \includegraphics[keepaspectratio = true,scale = 0.45]{plots_nohorse.png}}
    \caption{Regularization priors used in this paper. From left to right: Ridge normal prior $N$(0,1), Lasso Laplace prior with location = 0, scale = 4, and the regularized horseshoe prior with $ \beta_{j}|\lambda_{j}, \tau, c \sim \textup{N}(0, \tau^{2}\tilde{\lambda_{j}^{2}}), \; \textup{where} \; \tilde{\lambda_{j}^{2}} = \frac{c^{2}\lambda^{2}_{j}}{c^{2}+\tau^{2}\lambda_{j}^{2}}, \;\textup{and} \; \lambda_{j} \sim 
 \mathcal{C^+}(0,1)$.}
    \label{shrinkplots}
\end{figure}

\vspace{-12pt}

\subsubsection{The Ridge Prior} 

Frequentist ridge regression \citep{hoerl1970ridge,hoerl1985ridge} aims to yield a parsimonious regularized regression model in the presence of highly correlated variables. The Bayesian specification of ridge regression was suggested by \citeauthor{Hsiang75} (\citeyear{Hsiang75}), who showed that if the ridge estimator, $\boldsymbol{\beta}$, has a mean of zero and covariance matrix $\boldsymbol{\Sigma} = (\sigma^2/\lambda)\mathbf{I}$, and if $\epsilon \sim N(0,\sigma^2_{\epsilon}\mathbf{I})$, then the posterior mean of $\boldsymbol{\beta}$ is $(\boldsymbol{x'x}+\lambda \mathbf{I})^{-1}\boldsymbol{x'y}$, which is an alternative specification of the ridge estimator. The penalty term ($\lambda$) is captured through normally distributed independent priors placed on the regression slope parameters. These normal priors have mean hyperparameter values fixed at zero in order to control shrinkage toward zero. The variance hyperparameter is typically rescaled to be in standard deviation form and is set to define the degree of spread that the distribution exhibits. Note that we specify a half-Cauchy prior distribution, denoted as $\mathcal{C}^+$(0,1), for the residual standard deviation, but other conjugate priors could be specified as well. A representation of the ridge prior is given in the left of Figure \ref{shrinkplots}. 

\subsubsection{The Lasso Prior}

A drawback of ridge regression is that it does not improve parsimony in that all of the variables still remain in the model after penalization \citep{ZouHastie2005}. A method that appears similar to ridge regression but can yield a parsimonious model is the \textit{least absolute shrinkage and selection operator} \citep{Tibshirani1996}. 

The Bayesian lasso \citep{Park2008} uses a double exponential or Laplace prior where 

\begin{equation}
p(\beta_j) = \frac{1}{2\tau}\mbox{exp}\left(-\frac{|\beta_j|}{\tau}\right),
\end{equation}
where $\tau = 1/\lambda$.

The middle of Figure \ref{shrinkplots} shows the double exponential distribution. We see that this distribution is ideal because it peaks at zero, shrinking small coefficients toward zero. However, the double exponential can be set to have thick tails, allowing larger coefficients to remain large. Given that the distribution is centered at zero to control shrinkage toward zero, the mean hyperparameter setting is fixed to zero. The scale, or dispersion, of the double exponential distribution configurable hyperparameter when implementing the lasso. This defines the amount of spread and the thickness of the tails, which controls the degree of shrinkage in coefficients. Again, a $\mathcal{C}^+$(0,1) prior can be specified on the standard deviation of the residuals, if desired.

\subsubsection{The Regularized Horseshoe Prior}

The regularized horseshoe is a variant of the original horseshoe prior \citep{carvalho2009, carvalho2010}. The original horseshoe prior can be characterized as a scale mixture of normals with half-Cauchy tails offering unique features in enacting shrinkage that distinguish it other regularization priors. More specifically, the tails of its $\mathcal{C^+}$ distribution permit large parameters to remain unregularized, while the global shrinkage parameter $\tau$ severely shrinks parameters that are small. 

A limitation of the original horseshoe prior relates to cases where large coefficients can transcend the global scale set by $\tau_0$ with the impact being that the posteriors of these large coefficients can become quite diffused, particularly in the case of weakly-identified coefficients \citep{betancourtSparsity, piironen2017, kaplanbayesbook2}. To remedy this issue, \citeauthor{piironen2017} (\citeyear{piironen2017}) proposed a \emph{regularized} version of the horseshoe prior. Following the notation used in \citeauthor{betancourtSparsity} (\citeyear{betancourtSparsity}) the regularized horseshoe prior takes the form of the following:

For $j = 1, ... ,p$, where $p$ are the number of predictors, 

\vspace{-12pt}

\begin{subequations}\label{reghorse}
\begin{align}
     \beta_{j} &\sim \mathcal{N}(0, \tau^2\tilde{\lambda_{j}^2}),\\
     \tilde{\lambda_{j}} &= \frac{c\lambda_{j}}{\sqrt{c^{2} + \tau^{2}\lambda^{2}_{j}}},\\
      \lambda_{j} &\sim \mathcal{C^+}(0,1),\\
      c^{2} &\sim \mathcal{IG}\left(\frac{\nu}{2}, \frac{\nu}{2}s^{2}\right),\\
    \tau &\sim \mathcal{C^+}(0, \tau_{0}), 
\end{align}
\end{subequations}

\noindent where $c > 0$ and $s^2$ is the variance for each of the $p$ predictor variables. Those variables that have large variances would be considered more relevant a priori, and while it is possible to provide predictor-specific values for $s^2$, generally we scale the variables ahead of time so that $s^2 = 1$. Finally, $c^2$ is the slab width, which controls the size of the large regression coefficients \citep{piironen2017}. The density plot for the regularized horseshoe is given on the right of Figure \ref{shrinkplots}. 

\section{Present Study}

A Monte Carlo simulation was conducted to evaluate shrinkage and out-of-sample predictive performance across 6 prior distributions and 4 sample size conditions ($n = 50, 100, 500, 1000$). For each iteration, a population of 10,000 observations was generated with 40 normal predictors grouped into 5 Bayesian linear regression models and an intercept-only model. Each model had half the coefficients set as small (ranging from 0 to 1), and half large (ranging from 10 to 20) with coefficient values varying across models. The outcome variable, $y$, was generated using these coefficients and the full population data, from which a standardized random sample of size $n$ was drawn for  model fitting and analyses. The same sample data were used for all prior conditions.

\begin{table}[hbt!]
\centering
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{r|c}
\toprule
\textbf{Prior Condition} & \textbf{Specification for $\beta_j$} \\ 
\midrule
\textit{Non-Informative} & $\mathcal{N}(0,100)$             \\
\textit{\proglang{rstanarm} Default}   & $\mathcal{N}(0,2.5)$             \\
\textit{Informative}     & $\mathcal{N}(\bar{x_j}, 1)$      \\
\textit{Ridge}           & $\mathcal{N}(0, 1)$               \\
\textit{Lasso}           & $\frac{1}{2\tau}\mbox{exp}\left(-\frac{|\beta_j|}{\tau}\right)$ \\
\textit{Reg. Horseshoe}  & $ \mathcal{N}(0, \tau^2\tilde{\lambda_{j}^2})$ \\ 
\bottomrule
\end{tabular}
\end{table}

Hyperparameters for the regularized prior conditions were selected based on previous literature recommendations to control shrinkage toward zero, as detailed in previous sections \citep{Hsiang75, Park2008, piironen2017}. Each of the 24 study conditions was run for 500 iterations. 

\subsection{Evaluating Predictive Performance}

For this paper, we use Bayesian leave-one-out cross-validation (LOO-CV) to evaluate model out-of-sample predictive performance \citep{vehtari2017loo}. Bayesian LOO-CV is a special case of \emph{k-fold} cross-validation, in which the data set is divided into $k$ folds. The model of interest is fit with the training set and then compared to the $i^{th}$ observation in the test set to measure predictive performance. LOO-CV is a \emph{k}-fold cross-validation procedure where $k=n$.

The LOO-CV is uniquely suited to the question of out-of-sample predictive performance \citep{Allen74, Stone74}. The LOO-CV is quite similar to the \emph{widely applicable information criterion} (WAIC) as a fully Bayesian counterpart to the AIC \citep{Watanabe2010}. 

The \textit{expected log point-wise predictive density} (ELPD) for LOO-CV, the $\text{ELPD}_{loo}$, is defined as:

\vspace{-6pt}

\begin{equation}\label{ELPDloo}
\mbox{ELPD}_{loo} = \sum_{i=1}^{n}\mbox{log}\,p(y_i\mid y_{-i}),
\end{equation}

\noindent where 

\vspace{-12pt}

\begin{equation}
p(y_i\mid y_{-i}) = \int p(y_i\mid\theta)p(\theta\mid y_{-i})d\theta
\end{equation}

\noindent is the LOO predictive density given the data with the $i^{th}$ data point left out \citep{vehtari2017loo}. The log sum of these predictive densities in Equation (\ref{ELPDloo}) is the LOO-CV estimate of the ELPD \citep{GelmanHwangVehtari2014, Gronau2019, vehtari2017loo}. 

An information criterion based on LOO, referred to as the \emph{LOO-IC}, can be derived as

\begin{equation}
\mbox{LOO-IC} = -2\,{\widehat{\mbox{ELPD}}_{loo}}
\end{equation}

\noindent which places the LOO-IC on the deviance scale. Among a set of competing models, the one with the smallest LOO-IC is considered the best from an out-of-sample point-wise predictive point of view. We use the LOO-IC for the comparison of our regularization priors in our simulation study.

\section{Results}

For this study, we aimed to examine differences in member model-induced shrinkage and model stack predictive performance across simulation conditions. %removed model weighting strategies,

% \noindent
% \begin{figure}[hbt!]
% \centering
% \includegraphics[keepaspectratio = true, scale = 0.45]{wts_combined.pdf}
%   \caption{Composition of mean stacking weights across prior, sample size, and weighting method conditions.} \label{wts}
% \end{figure}

%First, we compared the distribution of stacking model weights across our simulation conditions and weighting methods. The dispersion of model weights across weighting methods, sample size, and prior condition are contained in Figure \ref{wts}. We observe differences between weighting methods across conditions, as $\textup{ELPD}_{loo}$ model weights vary across the stack member models, with one model not dominating the others. Whereas for pseudo-BMA and pseudo-BMA+, especially with large samples, we can see a single model carrying a large majority of the weight across conditions. This may be due to dominance of predictive performance of a single model in the larger sample size conditions, as depicted in Figure \ref{loo_all}. We did not observe consistent patterns in model weights across the sparsity conditions. Note for the remainder of the analysis, we will be using the  $\textup{ELPD}_{loo}$ weights. 

\noindent
\begin{figure}[hbt!]
    \centering
    \includegraphics[keepaspectratio = true, scale = 0.45]{coef_shrinkage_v2.pdf}
    \caption{Mean total coefficient estimates for each linear member model across prior and sample size conditions, demonstrating induced shrinkage via regularized priors. Note: Model 1 is omitted as it is an intercept model with no regularized coefficients or variation across conditions.}
    \label{coef_shrink}
\end{figure}


Induced shrinkage for the linear member models were compared by prior and sample size conditions, seen in Figure \ref{coef_shrink}, which depicts the sum of coefficient estimates for each linear model across conditions. We observe that for small samples in particular, the lasso and regularized horseshoe induced the greatest amount of shrinkage for the linear member models compared to the other prior conditions. This expected finding demonstrates that prior distribution selection is influential in the amount of sparsity introduced into the stack member models, particularly for small samples where priors are more influential. %However, the main purpose of this work is to investigate whether these sparsity-inducing regularization priors benefit the performance of the stack specifically. 

\noindent
\begin{figure}[hbt!]
    \centering
    \includegraphics[keepaspectratio = true, scale = 0.45]{loo_plot_vertical.pdf}
    \caption{Comparison of mean LOO-IC estimates for individual models and model stacks across prior distribution and sample size conditions. The black line represents the stack’s mean LOO-IC. Note: Model 1 is omitted as it is an intercept model with no regularized coefficients or variation across conditions.}
    \label{loo_all}
\end{figure}

Lastly, we compared the out-of-sample predictive performance of each linear member model to the stacked prediction across conditions, visualized in Figure \ref{loo_all}. We find that for the linear member models, the lasso and regularized horseshoe demonstrated a boost in predictive performance in the form of the LOO-IC, particularly when samples are small. We also observed with small samples especially that the stacked predictions outperform all the member models. However, we saw no improvement in predictive performance from regularization via the lasso and regularized horseshoe prior for the model stack. Prior selection did not impact LOO-IC estimates for the model stack despite benefiting the linear member models. 

\section{Discussion}

In this work, we found that inducing sparsity in Bayesian stacking member models boosts individual model out-of-sample predictive performance, especially when $n$ is small, as expected \citep{Harra2023}. However, there appears to be no meaningful boost in predictive performance for the stacked models. In line with previous work on this topic, we observed that stacked predictions have stronger predictive accuracy than any individual member model \citep{yao2018, kaplan2024}. We also found that introducing regularization priors to the linear member models introduced sparsity and improved the predictive performance of the individual member models. 
  
Our work here aligns with previous research demonstrating that stacked models dominate in predictive performance over any individual model \citep{Breiman1996,yao2018,kaplan2024}. As we expected, particularly with small samples, the model stack demonstrated improved out-of-sample predictive performance over the member models. This finding, and similar previous findings, demonstrate the effectiveness of Bayesian stacking. 

While introducing sparsity can help with variable selection for individual models, there is insufficient evidence that sparsity can also help improve stacked predictions. It is possible that the stacking procedures negate gains in predictive performance that regularization introduces. Or, that the stacked models outperform any individual model to the extent that regularization via priors like the regularized horseshoe do not further that improvement in performance. 
 
Our findings are limited to this particular simulation study. It's possible that other model ensemble scenarios, such as those with highly correlated variables, variables with vastly different effect sizes, or others, may find benefits in incorporating sparsity-inducing priors into Bayesian stacking. Alternatively, situations where the number of predictors $p$ outnumber observations $n$ may be worth investigating, as cases where $p > n$ has been shown to be when regularization is particularly useful \citep{vanErp2019}. Future research may aim to focus on what scenarios, if any, introducing Bayesian regularization into Bayesian stacking may prove useful. However, given the findings of this paper, we still recommend that researchers explore a variety of priors and weighting methods to optimize prediction for their models. 


% \begin{acknowledgement}
% Insert the Acknowledgment text here.
% \end{acknowledgement}

\paragraph{Funding Statement}

The research reported in this paper was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant \#R305D220012 to the University of Wisconsin-Madison. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.

\paragraph{Competing Interests}

The authors of this work claim no competing interests.


%\endnote in some journals will behave like \footnote; and \printendnotes will not output anything. 
%\printendnotes

\printbibliography

\appendix

%\section{Example Appendix Section}


\end{document}