A Model Diffing Approach For Identifying Causal Latents of Fine-Tuning-Induced Hallucinations

Dimitris Dimakopoulos; Shay B Cohen; Ioannis Konstas

A Model Diffing Approach For Identifying Causal Latents of Fine-Tuning-Induced Hallucinations

Dimitris Dimakopoulos, Shay B Cohen, Ioannis Konstas

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Steering, AI Safety, Interpretability, Alignment

Abstract: Large Language Models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training involving instruction-tuning or safety training often introduce new facts not present in parametric knowledge, giving rise to hallucinations. While it has been empirically shown that Supervised Fine-Tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. In this work, we conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune LlaMa 3.1 8B base on 7 distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained Sparse Autoencoders (SAEs) to mechanistically analyze residual stream activations across 42 checkpoints and propose a model-diffing approach for capturing causally relevant latents in this context. Our findings underscore the importance of curating post-training datasets that align with the base model's parametric knowledge and offer novel insights into how models change when integrating new knowledge.

Primary Area: interpretability and explainable AI

Code Of Ethics: true

Submission Guidelines: true

Anonymous Url: true

No Acknowledgement Section: true

Submission Number: 25272

Loading