How does fine-tuning affect your model? Mechanistic analysis on procedural tasks

Published: 02 Nov 2023, Last Modified: 18 Dec 2023UniReps PosterEveryoneRevisionsBibTeX
Keywords: Fine-Tuning, Interpretability, Mechanisms
TL;DR: We demonstrate that fine-tuning models rarely alters their underlying capabilities.
Abstract: Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.*
Track: Extended Abstract Track
Submission Number: 81