Keywords: Large Language Models, Interpretability, AI Safety
TL;DR: We give a principled metric quantifying how much the fine-tuning stage contributed to the output of an LLM, and explore its relationship to model behavior and safety.
Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks.
However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking.
In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model.
We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component.
Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass.
Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component.
We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they don't.
This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks.
In summary, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8140
Loading