Understanding Pre-trained and Fine-tuned model behaviour using Model Diffing

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 RejectEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Diffing, Foundational Research, Model Behaviour
Abstract: Fine- tuning large language models (LLMs) for specialized domains can alter both their output distributions and internal mechanisms in ways that standard task metrics obscure. We study model diffing between a pretrained base (DeepSeek- R1 -Distill-Qwen - 1.5B) and a LoRA- adapted variant trained for medical reasoning on HuatuoGPT -o1 style data. Our protocol couples (i) next-token KL divergence, measured across general and medical corpora—to quantify output-level shift, with (ii) activation patching to localize where domain knowledge and reason ing procedures are encoded. We target the LoRA -modified projections and MLP pathways and analyze the behavioral impact of swapping per -layer activations across models. Complementary experiments explore Complex Chain- of-Thought fine- tuning and a Kahneman- Tversky Optimization (KTO) objective from the Human- Aware Loss family to encourage structured reasoning without preference labels. Empirically, we observe domain- selective distributional drift with minimal degradation on general text, and layerwise concentration of medical competence consistent with prior findings that factual/semantic knowledge often resides in mid- to- late MLP blocks. Our contributions are: (1) a unified, reproducible KL plus patching diffing protocol; (2) evidence on how LoRA placements mediate domain specialization; and (3) an analysis of reasoning-oriented post-training (CoT/KTO) and its interaction with representational localization. We release code and scripts to support systematic model diffing in domain-specific alignment.
Submission Number: 59
Loading