Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Published: 05 Mar 2024, Last Modified: 08 May 2024ICLR 2024 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, fine-tuning
TL;DR: We demonstrate that fine-tuning models rarely alters their underlying capabilities
Abstract: Fine-tuning large pre-trained models has become the de facto strategy for devel- oping models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modu- late existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model’s underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a ‘wrapper’, is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient “revival” of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task.
Submission Number: 86
Loading