\input{floats/algo}
\section{Methods}\label{sec:methods}
In this section we provide the details of our proposed approach. We first show how we construct an $r$-dimensional subspace of the full weight space $\W$ and then discuss how we train a probabilistic model in this subspace using stochastic variational inference.

\subsection{Subspace Construction}\label{sec:subspace}
To begin, consider a LoRA layer with rank $r$, initial weights $\W_0$, and low rank factors $\A$ and $\B$. We would like to generate an $r$-dimensional subspace defined by vectors $\s \in \R^{r}$ which can be projected into the full weight space $\W$. We retain $\B$ and use it as a projection matrix as in LoRA, allowing us to focus on building a subspace over $\A$. It is tempting to follow \cite{izmailov2020subspace} and learn a simple linear projection matrix $\bP$, resulting in the following subspace:
\begin{align}
    \mathcal{S}_{lin} = \{\W | \W = \W_0 +  \B \bP \s\}
\end{align}
However, $\A$ is an $r \times d$ matrix, so the product $\bP \s$ would need to be a vector of length $rd$ which then could be reshaped. This means that $\bP$ would need to have dimensionality $rd \times r$, and since we would take $\bP$ to be a matrix of learnable parameters, this choice of subspace would eliminate any parameter savings of LoRA. 

To motivate a more parameter efficient subspace construction, consider the truncated Singular Value Decomposition (SVD) of $\A$:
\begin{align}
    \U\diag{\s}\V = \text{SVD}(\A)
\end{align}
where $\s \in \R^r$ is the vector of singular values, $\U \in \R^{r \times r}, \V \in \R^{r \times d}$ are the left and right singular vectors, and $\diag{\cdot}$ is the diagonal embedding operator. Using $\U$ and $\V$ as projection matrices naturally defines the following subspace:
\begin{align}
    \mathcal{S}_{\text{SVD}} &= \{ \W | \W = \W_0 + \B \U \diag{\s} \V \} 
    \label{eq:svdsubspace}
\end{align}
We notice that $\B \U$ is a product of linear parameter matrices, which has the same representational power as $\B$ alone. Furthermore since the dimensionality of $\V$ is equal to that of $\A$, we simply rename $\V$ to be $\A$. This results in the following subspace:
\begin{align}
    \mathcal{S}_{\text{ScalaBL}} &= \{ \W | \W = \W_0 + \B \diag{\s} \A \} \\
    &= \{ \W | \W = \mathbf{f}(\s) \} 
\end{align}
where $\mathbf{f}$ is a projection function which is defined for notational convenance. 
Intuitively, we are repurposeing the LoRA parameters as projection matrices for an $r$-dimensional subspace that sits ``in-between'' $\A$ and $\B$.

\input{floats/tables/datasets}
\subsection{Variational Subspace Inference}\label{sec:subspace_inference}
Next we build a probabilistic model in this subspace with data likelihood given by:
\begin{align}
    P(\D | \s)= P(\D | \W=\mathbf{f}(\s) )
\end{align}
We set our variational approximation over $\s$ as an $r$-dimensional diagonal Gaussian distribution:
\begin{align}
    q_{\btheta}(\s)=\N(\s|\s_{\mu}, \diag{\s_{\sigma}})
\end{align}
with mean and variance parameters $\btheta=[\s_{\mu},\s_{\sigma}]$.
% Like BLoB, 
To learn these variational parameters we use stochastic variational inference. At training step $t$, we use the reparameterization trick \citep{vae} to generate a sample from $q_{\btheta}(\s)$ and project into the full weight space:
\begin{align}
    \W_t = \W_0 + \B \diag{\s_{\mu} + \s_{\sigma} \cdot \eps_t} \A
    \label{eq:reparam}
\end{align}
where $\eps_t \sim \N(0,1)$.
We then maximize the evidence lower bound (ELBO) \citep{elbo} for each batch $\D_t$:
\begin{align}
    \mathcal{L}_t &= \log P(\D_t|\W_t) - \beta D_{KL}( q_{\theta}(\s)|| P(\s))
\end{align}
Here the first term is the data likelihood under the LLM using the weight sample $\W_t$. The second term regularizes $q_{\theta}(\s)$ against a prior $P(\s)$, where $\beta$ is a scalar hyperparameter which controls the regularization strength \citep{betavae}. Our full training approach is shown in Algorithm \ref{alg:scalabl}. 

At test-time, we draw $N$ samples from $q_{\btheta}(\s)$, project them into the full weight space, and compute a Bayesian model average:
\begin{align}
    \mathbb{E}_{\s_n \sim q_{\btheta}(\s)} [ P(\y | \x, \s_n )] \approx \frac{1}{N} \sum_{n=1}^N  P(\y | \x, \mathbf{f}(\s_n))
\end{align}

We maintain $\A$ and $\B$ as learnable parameters as in LoRA. We then additionally need to learn just $2r$ variational parameters $\s_{\mu}$ and $\s_{\sigma}$. We note that BLoB can also be cast as a form of subspace inference where the LoRA layer itself defines the subspace of $\W$. This is a much higher dimensional subspace than the one used in ScalaBL, so performing variational inference results in needing to learn $rd$ additional parameters per LoRA layer.


