\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{floats/figs/methods.png}
    \caption{Visual depiction of prior work and our approach for a single layer. Blocks shaded red denote parameters which are trained and blue blocks denote frozen parameters. White blocks with hatching denote parameters which are sampled from learned variational distributions.}
    \label{fig:methods}
\end{figure*}

\section{Introduction}\label{sec:intro}
The use of large language models (LLMs) have become ubiquitous across many domains ranging from healthcare \citep{llmsmed}, scientific discovery \cite{zhang2024comprehensive}, cyber-physical systems \citep{aircraftverse}, code generation \citep{jiang2024survey}, and general everyday use \citep{anil2023gemini}. Therefore, ensuring that these models are reliable and trustworthy has never been more vital. However, it is well known that LLMs output incorrect information in the form of "hallucinations" \citep{huang2024survey} and are often poorly calibrated \citep{zhu2023calibration,spiess2024calibration}. One direction of research aimed at solving these issues considers quantifying the uncertainty of LLM outputs. A variety of post-hoc approaches have been proposed for this task, such as verbalized confidence \citep{tian2023just,xiongcan}, quantifying token level uncertainty \citep{sement,farquhar2024detecting}, or conformal prediction \citep{kaur2024addressing}.

In contrast, Bayesian deep learning (BDL) provides a principled approach to the uncertainty quantification of deep models. In this family of approaches, uncertainty quantification is performed by directly inferring a distribution over the weights of the model \citep{mcdropout,bbb,deepensembles}. Here we estimate a model's predictive uncertainty for a test instance $\mathbf{x}$, denoted $P(\y|\x, \D)$, by using Bayes' Rule to marginalize over the parameter posterior distribution, denoted $P(\W|\D)$, via the following integral:
\begin{align}
    P(\y|\x, \D) = \int P(\y|\x, \W) P(\W|\D) \mathrm{d} \W
\end{align}
where $\mathcal{D}$ is a training (or fine-tuning) dataset, and $\W$ are the model parameters. However, when scaling such techniques to LLMs, providing a good approximation of this intractable integral becomes increasingly challenging due to the large dimensionality of $\W$. For this reason, recent work has considered performing Bayesian inference over the smaller subset of parameters learned in popular parameter efficient fine-tuning (PEFT) approaches \citep{fu2023effectiveness}. 

In the widely-used low-rank adaptation (LoRA) technique of \cite{lora}, only a small subset of parameters are updated, saving considerable resources compared to updating the entire parameter set, while still enjoying most of the performance of the base model. Conveniently, the low dimensionality of these parameters additionally makes them well suited for BDL techniques. However, \cite{lap} and \cite{blob} have shown that directly applying BDL techniques such as Deep Ensembles \citep{deepensembles} or Monte Carlo Dropout \citep{mcdropout} over LoRA only leads to a marginal improvement on uncertainty quantification metrics compared to straightforward fine-tuning approaches such as maximum likelihood estimation (MLE) or Maximum a Posteriori (MAP). 

The first success in this space came from 
\citet{lap} who perform a Laplace approximation of the parameter posterior after MAP fine-tuning. 
The state-of-the-art approach of 
\citet{blob} instead uses stochastic variational inference in a technique they call Bayesian LoRA by Backprop (BLoB). Although this approach performs better than any previous approach, it comes at the cost of needing ${\sim}40\%$ more parameters than LoRA. This can be a major memory bottleneck in high-stakes, resource-constrained deployments where computing the Bayesian model average already stresses the available memory budget \citep{ursabench}. 

In this work, we introduce \textbf{Scala}ble \textbf{B}ayesian \textbf{L}ow Rank Adaptation via Stochastic Variational Subspace Inference (ScalaBL). As shown in Figure \ref{fig:methods}, we perform Bayesian inference inside a much smaller subspace of the full weight space $\W$ with dimensionality equal to the LoRA rank $r$. We show how we can repurpose the LoRA parameters $\A$ and $\B$ as projection matrices which map samples from the low dimensional subspace into the full weight space $\W$. We then learn the parameters of our approach using stochastic variational inference.

A major benefit of our approach is that it requires learning only $2r$ additional variational parameters for each LoRA layer, compared to the $rd$  parameters required by BLoB, where $d$ is the embedding dimension of the LLM. For example, when fine-tuning an LLM with 7 billion parameters where $d=3584$ using a rank of $r=8$, BLoB requires millions of additional parameters, while ScalaBL requires only ${\sim}1000$. Furthermore, so long as the rank $r$ remains constant, our approach requires the same number of additional parameters per layer regardless of the embedding dimension of the base LLM. As a result, we are able to scale our approach to a 32 billion base parameter model where $d=5120$, compared to the 7 billion parameter models considered by the prior work of \cite{lap} and \cite{blob}. Through extensive experimentation, we show that ScalaBL has competitive or superior performance compared to these state-of-the-art baselines on a suite of commonsense reasoning benchmarks in both in- and out-of-distribution settings.

We highlight our main contributions as follows:
\begin{itemize}[leftmargin=*, topsep=0pt, noitemsep]
    \item We propose ScalaBL, a Bayesian LoRA approach which performs stochastic variational inference inside a low dimensional subspace.
    \item ScalaBL enjoys considerable parameter efficiency compared to prior work and requires ${\sim}2000 \times$ fewer additional parameters.
    \item ScalaBL achieves competitive or superior performance to state-of-the-art approaches in terms of uncertainty quantification metrics, while requiring fewer parameters.
    \item Our work is the first to scale a Bayesian LoRA approach to a pre-trained model of 32 billion base parameters, compared to the 7 billion parameter models of prior work.
\end{itemize}

The structure of the paper is as follows.
In Section \ref{sec:prior_work}, we discuss relevant prior work that our approach builds on.
In Section \ref{sec:subspace}, we demonstrate our approach for building a parameter-efficient subspace and in Section \ref{sec:subspace_inference}, we discuss how to train a probabilistic model in this subspace using stochastic variational inference.
 In Section \ref{sec:experiments}, we provide results of our extensive experiments.
Finally, in Sections \ref{sec:limitations} and \ref{sec:conclusions}, we discuss limitations and conclude. Additional details and experimental results are included in the Appendix.
Our code is available at \url{github.com/SRI-CSL/BayesAdapt}. 
