Steering Fine-Tuning Generalization with Targeted Concept Ablation

Published: 05 Mar 2025, Last Modified: 17 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: Interpretability, Mechanistic Interpretability, Fine-Tuning, Artificial Intelligence, AI, Sparse Autoencoders, Machine Learning
TL;DR: We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts.
Abstract: Models often learn unintended behaviors during fine-tuning, such as adopting spurious correlations present in training data. We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts. Our approach steers models toward intended generalizations even in cases where multiple policies correctly fit the training data. We evaluate our method on two tasks, significantly outperforming baselines: a gender bias task containing spurious correlations and a double multiple choice task where models must learn to focus on intended questions while ignoring others. On gender bias, our method completely eliminates spurious correlations, leading to strong performance out of distribution. In double multiple choice, it succeeds in 10 out of 16 scenarios. Our results mark an initial step toward using interpretability techniques to ensure the safe and reliable deployment of frontier AI systems.
Submission Number: 123
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview