Steering Fine-Tuning Generalization with Targeted Concept Ablation

Published: 05 Mar 2025, Last Modified: 06 Mar 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: Interpretability, Mechanistic Interpretability, Fine-Tuning, Artificial Intelligence, AI, Sparse Autoencoders, Machine Learning
TL;DR: We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts.
Abstract: During fine-tuning, multiple solutions may emerge which perform similarly on training data but generalize differently out of distribution. For instance, a deceptive model may be indistinguishable from an aligned model during training, but perform catastrophically at deployment. We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts. Our approach steers models toward intended generalizations when multiple policies correctly fit the training data. We evaluate our method on two tasks, significantly outperforming baselines: a gender bias task containing spurious correlations and a double multiple choice task where models must learn to focus on intended questions while ignoring others. On gender bias, our method completely eliminates spurious correlations, leading to strong performance out of distribution. In double multiple choice, it succeeds in 12 out of 16 scenarios. Our results mark an initial step toward using interpretability techniques to ensure the safe and reliable deployment of frontier AI systems.
Submission Number: 123
Loading