Keywords: Knowledge Distillation; Low-Rank Adaptation; LoRA; Parameter-Efficient Fine-Tuning; Model Compression; Spectral Alignment
TL;DR: SAD-LoRA improves low-rank knowledge distillation by aligning the LoRA adapter subspace with the teacher’s data-weighted spectral update, yielding stronger rank efficiency and robust gains on spectrally structured tasks such as STS-B, CoLA, and RTE.
Abstract: Distilling a fine-tuned teacher into a LoRA-adapted student is a standard recipe for parameter-efficient compression, but output-level KD does not explicitly control which rank-$r$ weight subspace the adapter occupies. We propose SAD-LoRA (Spectral Alignment Distillation), which selects this subspace from the data-weighted student-space reference update and maintains it during training via a differentiable principal-angle loss on colspan(B). We show that the data-weighted distillation error decomposes exactly into subspace misalignment, within-subspace coefficient mismatch, and irreducible rank residual; standard KD can affect the first term only indirectly through output gradients. On controlled synthetic problems with a flat teacher spectrum, SAD-LoRA reduces the subspace-misalignment term from $51\%$ to nearly zero and lifts final subspace alignment from $0.49$ to $1.00$. On RoBERTa-large to RoBERTa-base distillation across six GLUE tasks, SAD-LoRA improves rank efficiency: at $r{=}4$, it matches or beats the strongest included spectral baseline on five of six tasks, and at $r{=}8$ it gives the best result on SST-2 and CoLA. Ablations identify subspace alignment as the load-bearing component, while coefficient matching is auxiliary.
Submission Number: 133
Loading