Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

Published: 24 May 2026, Last Modified: 02 Jun 2026ICML 2026 Workshop WSS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Compression, Spurious Correlations
TL;DR: We find that compressing fine-tuning updates can effectively mitigate learned spurious correlations
Abstract: Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data. We show a simple post-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD of $\Delta W = W_\mathrm{ft} - W_\mathrm{base}$ reduces the spurious-group gap while preserving task accuracy. Across three instruction-tuned models ($0.5$B--$7$B) and four classification benchmarks, top-$k$ truncation reduces the gap on every cell at $<2$ pp accuracy loss, by up to $5\times$ on CivilComments. We propose this works because the shortcut response sits in the tail of the singular ordering of $\Delta W$, a behavioural claim invisible in the broadly-distributed raw spectrum but vivid across all four datasets, where gap and accuracy decouple under truncation. A controlled boundary case in which fine-tuning has only a shortcut to learn shows the predicted FT-to-base collapse, and bottom-/random-$k$ and matched-rank LoRA controls rule out generic low-rank approximation and rank-constrained training as the explanation. We read this as preliminary evidence that the singular basis of $\Delta W$ is a useful coordinate system for studying what fine-tuning has learned.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 27
Loading