TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We give a principled metric quantifying how much the fine-tuning stage contributed to the output of an LLM, and explore its relationship to model behavior and safety.
Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model's intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.
Lay Summary: AI writing tools like ChatGPT first learn from vast collections of internet text, and afterwards are trained to follow instructions and safety rules. But it's hard to know how much each stage — learning from the internet versus learning to follow instructions — contributes to any single reply, making it difficult to quantitatively analyse how the AI works and behaves. We introduce a way to peek at the AI's internal signals as it answers each question and split each reply into two contributions: from internet data versus from instruction data. From that split, we compute the Tuning Contribution (TuCo), a simple percentage that shows how much the instructions data shaped the response (for example, "30% tuning contribution"), compared to the internet data. TuCo can help researchers spot when the AI's instruction learning phase has less effect than intended, letting the AI go into "unfamiliar" territory for which it does not have instructions. It can reveal hidden blind spots — like trick prompts that quietly undermine safeguards — and can guide teams in strengthening defences. It can also point out questions where tweaks barely help, so developers can refine their training data and make AI systems more reliable.
Link To Code: https://github.com/FelipeNuti/tuning-contribution
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Interpretability, AI Safety
Submission Number: 7754
Loading