\input{floats/tables/qwen7B_main}
\section{Experiments}\label{sec:experiments}
In this section we provide experimental comparisons between ScalaBL, several standard baselines, and current state-of-the-art approaches.

\subsection{Datasets}
Following the experimental protocol of \cite{lap} and \cite{blob} we fine-tune and evaluate our approach using  a suite of commonsense reasoning datasets shown in Table \ref{tab:datasets}. These datasets are posed as multiple choice questions. Given an input prompt with a question, we elicit the LLM's softmax distribution over the next token. We then select the logits for each possible answer (e.g. A,B,C,D) and renormalize. In this way, we transform these commonsense reasoning tasks into a classification task. This makes it straightforward to compute standard uncertainty metrics. In particular, we report classification accuracy (ACC), expected calibration error (ECE) \citep{guo2017calibration}, and the negative log likelihood (NLL) of the correct class. See Appendix Section \ref{sec:metric} for further details on these metrics.

\subsection{Baselines}
We compare against a suite of standard baselines. First we consider the standard LoRA training procedure with and without weight decay regularization (labeled MLE and MAP respectively). Next we compare against the standard BDL baselines of Deep Ensembles \citep{deepensembles}, and Monte Carlo Dropout \citep{mcdropout}. Finally, we present results against the two most recent state-of-the-art approaches: the Laplace approximation approach of \cite{lap} and BLoB \citep{blob}.

\subsection{Implementation Details}
We build our approach using the \texttt{bayesian-peft} library of \cite{blob}. This provides implementations of the standard baselines as well as BLoB. For the Laplace approximation we use the official code provided by \cite{lap}. In contrast to \cite{lap} and \cite{blob}, we present results on the newer \texttt{Qwen2.5} \citep{yang2024qwen2} family of models, rather than the older \texttt{Llama-2-7b} model \citep{llama2} of prior work. For the sake of comparison, results using \texttt{Llama-2-7b} are provided in Appendix Section \ref{sec:llama2}. 

 Following \cite{lap} and \cite{blob}, we apply LoRA to the query and value parameters of each self-attention layer as well as the softmax output head of the LLM using rank of $r=8$. We follow the training procedure and hyperparameters of BLoB.
 All approaches are trained for 5000 steps using the AdamW optimizer. Training was performed using a batch size of 4 for the 7 billion parameter models and a batch size of 2 for the 32 billion parameter model. In contrast to \cite{blob}, we train all approaches using 16-bit precision for the frozen model parameters instead of using 8-bit quantization. The learnable model parameters remain 32-bit. All experiments were performed on a single 80GB NVIDIA A100 GPU.

For ScalaBL, we use the same KL weighting schedule as BLoB with an maximum value of $\beta=0.1$ We do not use the Flipout technique \citep{wen2018flipout} that was utilized by BLoB as we found that it did not noticeably effect performance. This simplifies the complexity of the implementation of our approach compared to BLoB. As in BLoB, we use a standard $\mathcal{N}(0,I_r)$ as the prior $P(\s)$. We initialize $\s_{\mu}$ and $\A$ by performing an SVD on a randomly initialized matrix. This is a fast operation due to the low rank nature of the LoRA matrices. Like in BLoB, the variance parameters $\s_{\sigma}$ were initialized as small uniformly random values. We use a log parametrization for $\s_{\sigma}$ 
to ensure the variance remain positive. Following the intuition that $\s_{\mu}$ is analogous to the singular values of $\A$, we ensure their positivity using a log parametrization as well.

For the variational approaches, BLoB and ScalaBL, we present our main results using $N=10$ posterior weight samples during evaluation, which \cite{blob} found to give the best performance. The effect of this hyperparameter is explored further in Appendix Section \ref{sec:samples}. Similarly, we perform 10 forward passes for the MC-Dropout baseline. For Deep Ensembles, we use an ensemble size of 3.

\input{floats/tables/qwen7B_ood}
\subsection{In-Distribution Results}
In Table \ref{tab:qwen7B_main} we present test set results for a standard in-distribution setting using the \texttt{Qwen2.5-7B} LLM. We first notice that a straightforward MLE fine-tuning approach leads to high accuracy across all datasets, but often overfits as evidenced by the poor ECE results. The MAP result is equivalent to MLE with a weight decay penalty of $10^{-2}$, which marginally improves final calibration. We see minor improvements in ECE and NLL when moving to Monte Carlo Dropout and Deep Ensembles, with Deep Ensembles performing the best of the standard baselines, albeit at significantly higher resource cost.

Validating the results of \cite{lap} and \cite{blob}, we see that the more recent state-of-the-art approaches out perform the baselines in terms of ECE and NLL, with minimal reduction in classification accuracy. Furthermore, we see that ScalaBL consistently achieves performance that is competitive with BLoB, and even achieves state-of-the-art performance on  Winogrande-Medium dataset in terms of ECE. 

Unsurprisingly, BLoB often performs the best out of all methods and regularly outperforms ScalaBL by a small margin. However, BLoB has strictly greater representational power than ScalaBL or Laplace due to its higher parameter count. Compared to MLE, BLoB requires an additional ${\sim}1.4\times$ as many parameters, while ScalaBL requires only ${\sim}1.0001\times$ as many. For this choice of LLM and rank, this results in BLoB adding ${\sim}1.6$ million parameters on top of MLE, while ScalaBL adds only 912. With that in mind, ScalaBL achieves very competitive performance at much lower cost compared to BLoB. For example, on the ARC-Challenge dataset, BLoB  sees ${\sim}1.3\times$ better ECE performance than ScalaBL with similar accuracy. However, BLoB requires $1792 \times$ more additional parameters than ScalaBL.

\subsection{Out-of-Distribution Results}
Next we consider an out-of-distribution experiment where models are trained on the OpenBookQA (OBQA) dataset which consists of grade school level, multiple choice science questions. First we evaluate this tuned model on the ARC datasets, which also consists of grade school level multiple choices, representing a smaller distribution shift. Next we investigate a larger distribution shift by evaluating on the more challenging MMLU-Chemistry and MMLU-Physics datasets which consist of undergraduate level chemistry and physics multiple choice questions, respectively. The results of this experiment for all methods are displayed in Table \ref{tab:qwen7B_ood}.

We again notice that the recent state-of-the-art approaches outperform the standard baselines in terms of uncertainty quantification with comparable accuracy. We see that all methods experience worse calibration when tested under large distribution shift. We additionally point out the poor accuracy of the Laplace method on the MMLU datasets. We see strong performance of our proposed method, with ScalaBL out competing BLoB and Laplace on several datasets in terms of ECE. Under a both small and large amounts of distribution shift, ScalaBL achieves comparable performance to BLoB across all metrics.

\input{floats/tables/qwen32B_main}
\subsection{Scaling to Larger Models}\label{sec:larger_models}
A limitation of the prior work of \cite{lap} and \cite{blob} is their use of relatively small LLMs with only 7 billion parameters. This makes it unclear if their experimental conclusions generalize to the much larger model sizes which are currently in use \citep{anil2023gemini}. For this reason we consider scaling our approach to the largest Bayesian LLM to date, \texttt{Qwen2.5-32B}, with four times as many base parameters as prior work. We conduct the same in-distribution experiments as before and present test set results in Table \ref{tab:qwen32B_main}. We note that we do not report results using the Laplace baseline as its post-hoc procedure exceeded the memory availability of our 80GB A100 GPU even when using 8-bit parameters and test time batch size of 1, underscoring the poor scalability of this method.

In contrast to earlier results, standard baselines are much more competitive when using a larger base model. We see that even simple techniques, such as MLE or MAP, lead to models which are much better calibrated than their smaller counterparts. This phenomenon has been noticed in prior work \citep{xiongcan,spiess2024calibration}. Furthermore, we see that Deep Ensembles is often the best performing approach across all three metrics. However, this comes at significantly higher resource usage. 

We observe that our proposed approach, ScalaBL, continues to show competitive performance against the baselines, including BLoB. It often performs the second best in terms of ECE and NLL, while experiencing a similar classification performance as BLoB. When using a larger base model, the efficiency and scalability of our method is even more pronounced. Moving from \texttt{Qwen2.5-7B} to \texttt{Qwen2.5-32B} increases the model's embedding dimension from 3584 to 5120 and adds an additional 12 layers. Since the number of variance parameters in BLoB scales with the embedding dimension, it now requires an additional ${\sim}5.2$ million parameters compared to MLE. By contrast ScalaBL's additional parameter count scales only with $r$ which was not changed for this larger model. For this reason, ScalaBL only requires adding an additional 2064 parameters. In fact, for this choice of LLM and rank, BLoB requires adding $2560\times$ more parameters than ScalaBL for similar performance. For this reason we feel that ScalaBL is the only method that is capable of scaling to current frontier models, which are already over a trillion base parameters \cite{anil2023gemini}.


