\section{Additional Implementation Details}
\subsection{Uncertainty Quantification Metrics}\label{sec:metric}
In our experiments we report results on two popular metrics for measuring uncertainty quantification: negative log likelihood (NLL) and expected calibration error (ECE).  For dataset of $N$ test instances $\x_n$ with correct label $y_n$ and a probabilistic model $P_{\btheta}$, NLL is defined as:
\begin{align}
    \text{NLL}=\frac{1}{N} \sum_{n=1}^N - \log P_{\btheta} (y_n|\x_n)
\end{align}
That is, it is the expected negative log probability of the correct class under the model.

ECE measures how a model's confidence aligns with the accuracy of its predictions. It can be computed by binning the predictions by their confidence. We then compute a weighted average of the difference between the accuracy and confidence within each bin:
\begin{align}
\text{ECE} = \sum_{k=1}^{K} \frac{|B_k|}{N} \left| \text{acc}(B_k) - \text{conf}(B_k) \right|
\end{align}
where $K$ is the number of bins and $B_k$ is the set of samples in the $k$-th bin. Following \cite{blob}, we use $K=15$ in all experiments.

\input{floats/tables/prompts}
\subsection{Prompts}
Following \cite{blob}, the prompts used for each dataset are displayed in Table \ref{tab:prompts}.

\subsection{Runtime Analysis}
\input{floats/tables/runtime}
The main efficiency savings gained by ScalaBL over BLoB is the reduction of the number of parameters that need to be learned. This also translates to gains in performance during training. In Table \ref{tab:runtime} we show the training resource usage for both ScalaBL and BLoB. We see that ScalaBL has lower peak memory usage and trains slightly faster than BLoB. 

\section{Additional Experimental Results}
\subsection{Effect of Number of Samples}\label{sec:samples}
An important hyperparameter for any variational approach is the number of weight samples $N$ to draw when computing the test time Bayesian model average. Using models fine-tuned on the Winogrande-Small dataset we explore different choices for this hyperparameter for both ScalaBL and BLoB. This is shown for all 3 metrics in Figure \ref{fig:combined_results} (Top). We see that performance across all metrics is saturated around $N=10$, validating the choice of \cite{blob}.

Its important to remember that the number of parameters sampled from the variational distribution is much smaller in ScalaBL as compared to BLoB. In Figure \ref{fig:combined_results} (Bottom) we use a log scale plot to compare how many parameters each method has to draw as we increase the number of samples that are performed.
\input{floats/effect_of_samples}


\input{floats/tables/choice_of_subspace}
\subsection{Choice of Subspace}
In this section we consider different choices for the subspace used in method. In Table \ref{tab:qwen7B_subspace}, we present results using the SVD subspace defined in Equation \ref{eq:svdsubspace}. We also include results for an experiment where the $\A$ matrix is frozen during fine-tuning. This is similar to the random subspace approach put forward by \cite{izmailov2020subspace}. 

We first notice that difference in performance between the SVD subspace and the subspace used in ScalaBL is negligible. This isn't surprising as adding the extra $\U$ matrix of parameters does not change the expressive power of the model as discussed in the main paper. An interesting upside of using the random subspace is that it further reduces the number of parameters that need to be learned. We see that for some datasets performance is comparable the subspaces with more parameters. However, on some datasets (such as Winogrande-Medium) there is a considerable reduction in classification accuracy when using a random subspace.

\input{floats/tables/choice_of_cov}
\subsection{Using a Full Rank Covariance}
Following the  prior work, we only considered using a diagonal covariance for $q_{\btheta}(\s)$ in our experiments in the main paper. For the approaches of \cite{lap} and \cite{blob} this is a necessary limitation as instantiating a full rank covariance with millions of dimensions would be intractable. However, the Gaussian distribution used in ScalaBL is only $r$-dimensional. This makes it straightforward to consider using a full rank Gaussian by adding a few more parameters.

We parameterize a full rank covariance matrix $\boldsymbol{\Sigma} $ as an eigen decomposition. We treat the eigenvalues $\mathbf{e} \in \R^{r}$ and  matrix of eigenvectors $\mathbf{E} \in \R^{r \times r}$ as learnable parameters. This adds $r + r^2$ additional parameters to the approach. We use the QR factorization to ensure that eigenvalues are orthogonal. 
\begin{align}
    \mathbf{E}, \mathbf{R} = \text{QR}(\mathbf{\hat{E}})\\
    \boldsymbol{\Sigma} = \mathbf{E} \diag{\mathbf{e}} \mathbf{E}^T
\end{align}
where $\mathbf{\hat{E}}$ are free parameters.

We then update the reparameterization trick to use the Cholesky factor of the covariance matrix. We apply the Cholesky factorization on-the-fly during learning. 
\begin{align}
    \mathbf{L} = \text{Cholesky}(\boldsymbol{\Sigma} )\\
    \W_t = \W_0 + \B \diag{\s_{\mu} + \mathbf{L}\eps_t}\A
\end{align}

We compare using a diagonal and a full rank covariance matrix in Table \ref{tab:qwen7B_cov}. We see that using a full rank covariance leads to very similar performance to using a diagonal one, with some datasets even exhibiting worse calibration.

\subsection{Effect of LoRA Rank}
In the prior work of \cite{lap} and \cite{blob} a LoRA rank of $r=8$ was used in all experiments. In the main paper, we use this value for the rank as well. In this section we explore the effect of the LoRA on performance for both ScalaBL and BLoB. We ran additional in-distribution experiments using Qwen2.5-7B with $r = [4,16,32]$ to compare against the results for $r=8$ which are already shown in Table \ref{tab:qwen7B_main}. This results are shown in Figures \ref{fig:r_sweep1} and \ref{fig:r_sweep2}.

We first note that the $x$-axes of these plots show the number of total model parameters, rather than the rank. This captures the fact that BLoB’s additional parameters grow more quickly as $r$ increases ($O(rd$)) compared to ScalaBL ($O(r)$). We see that BLoB often results in noticeable drops in accuracy as $r$ increases across multiple datasets (WG-S, ARC-E, ARC-C, WG-M, OBQA). This is then accompanied by increases in NLL. By comparison ScalaBL sees small increases in accuracy across most datasets as $r$ increases, albeit accompanied with small increases in ECE. Furthermore, BLoB sees larger increases in ECE compared to ScalaBL across multiple datasets (ARC-E, ARC-C, OBQA). The only time that BLoB sees a reduction in ECE is when the accuracy also decreases significantly (WG-S, WG-M). ScalaBL is robust to changes in $r$ across all 3 metrics. 
\input{floats/tables/llama7B_main}
\input{floats/tables/llama7B_ood}

\subsection{Llama2 Results} \label{sec:llama2}
For the sake of comparison with \cite{lap} and \cite{blob}, we present experimental results using the older \texttt{Llama-2-7b} LLM in Tables \ref{tab:llama2_main} and \ref{tab:llama2_ood}. We note that we reran BLoB and the standard baselines using 16-bit frozen parameters, instead of 8-bit quantized weights. The reported results for Laplace and Bayes By Backprop (BBB) \citep{bbb} are repeated from the tables of \cite{blob}. We obverse the same general trends as seen with \texttt{Qwen2.5-7B}. Our proposed approach achieves competitive performance with BLoB while requiring significantly fewer parameters.
\input{floats/r_sweep}
