\section{Prior Work}\label{sec:prior_work}
In this section we discuss prior work which our approach builds upon.

\subsection{Low-Rank Adaptation}
The low-rank adaptation (LoRA) approach of \cite{lora} has become a standard technique for fine-tuning LLMs in a tractable way. Consider a linear layer inside a pretrained LLM which has weights $\W_0 \in \R^{n \times d}$, where $d$ is the embedding dimension of the model and $n$ is the output dimension of the layer. A forward pass through the layer for a batch of $b$ input features $\mathbf{x} \in \mathbb{R}^{b \times d}$ is given by $\y = \x\W_0^T$.
In LoRA, rather than updating all of the model parameters, $\W_0$, we instead keep these parameters fixed and learn a new pair of low-rank parameters $\A \in \R^{r \times d}$ and $\B \in \R^{n \times r}$, such that:
\begin{align}
    \y = \x\W_0^T + \x (\B \A)^T
\end{align}
The value $r \ll \min(n,d)$ is commonly known as the LoRA rank. In this way, only $r(n+d)$ parameters need to be learned rather than $nd$, leading to considerable resource savings with minimal performance penalty.

 \subsection{Laplace LoRA}
The approach of \cite{lap} is the first example in the literature of applying uncertainty quantification techniques to LoRA layers by applying a Laplace approximation to the low-rank parameters. They treat a fine-tuned MAP estimate as the mean of a multivariate Gaussian distribution with covariance derived from the inverse Hessian. However, even when restricting the Laplace approximation to the LoRA parameters, evaluation of the Hessian is infeasible. Therefore \cite{lap} add structure to the Hessian by using a Kronecker factorization \citep{ritter2018scalable,daxberger2021laplace}. These Kronecker factors are still memory intensive, so \cite{lap} are forced to perform a further approximation via an iterative truncated singular value decomposition approach. The Laplace approximation is performed post-hoc after fine-tuning the LoRA parameters. An additional limitation is that at test time, they need to backpropagate through the model to build the approximated covariance matrix. This limits the scalability and use of their approach in resource-constrained environments.


\subsection{BLoB}
The current state-of-the-art approach in this space is Bayesian Low-Rank Adaptation by Backpropagation (BLoB).
BLoB moves away from the two stage approach of Laplace LoRA and instead performs stochastic variational inference over the LoRA parameters $\A$. More specifically, they follow the Bayes by Backprop  approach introduced by \cite{bbb}. That is, they recast $\mathbf{A}$ as the means of a low rank Gaussian distribution, denoted $\mathbf{A}_{\mu}$ and learn a set of variance parameters $\mathbf{A}_{\sigma}$. Using the reparameterization trick \citep{vae}, they project samples from this low rank distribution into the full weight space:
\begin{align}
    \W_t = \W_0 + \B (\A_{\mu}+\A_{\sigma} \cdot \eps_t)
\end{align}
where $\eps_t \sim \N(0,1)$.
\cite{blob} show empirically that their approach leads to better performance than Laplace LoRA. However, a notable upside of the Laplace approximation is that it does not require learning any additional parameters. Due to the variance parameters $\A_{\sigma}$, BLoB requires learning $1.4\times$ more parameters than the base LoRA fine-tuning process. Even for smaller 7 billion parameters models, this can be millions of additional parameters.

\subsection{Bayesian Subspace Inference}
Rather than trying to approximate the parameter posterior directly, \cite{izmailov2020subspace} purpose to perform Bayesian inference in a much smaller $k$-dimensional subspace of the full parameter space defined by vectors $\z \in \R^k$. They then learn a model $P(\z|\D)$, and a projection matrix $\bP$. This allows them to project samples into  full weight space:
\begin{align}
    \W_t = \W_0 + \bP \z_t \\
    \z_t \sim P(\z|\D)
\end{align}
As highlighted by \cite{izmailov2020subspace}, this model is not a reparameterization of the original parameter posterior as the projection into the full parameter space is not invertible. However, it has the upside that performing Bayesian inference in the subspace enables the use of many common Bayesian inference techniques that would otherwise be intractable. To the best of our knowledge, such subspace inference techniques have never been applied to LLMs.