Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Pinto Coelho Nuti; Tim Franzmeyer; Joao F. Henriques

Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Pinto Coelho Nuti, Tim Franzmeyer, Joao F. Henriques

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Interpretability, AI Safety

TL;DR: We give a principled metric quantifying how much the fine-tuning stage contributed to the output of an LLM, and explore its relationship to model behavior and safety.

Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they don't. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In summary, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8140

Loading