A Model Diffing Approach For Identifying Causal Latents of Fine-Tuning-Induced Hallucinations

20 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Steering, AI Safety, Interpretability, Alignment
Abstract: Large Language Models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training involving instruction-tuning or safety training often introduce new facts not present in parametric knowledge, giving rise to hallucinations. While it has been empirically shown that Supervised Fine-Tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. In this work, we conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune LlaMa 3.1 8B base on 7 distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained Sparse Autoencoders (SAEs) to mechanistically analyze residual stream activations across 42 checkpoints and propose a model-diffing approach for capturing causally relevant latents in this context. Our findings underscore the importance of curating post-training datasets that align with the base model's parametric knowledge and offer novel insights into how models change when integrating new knowledge.
Primary Area: interpretability and explainable AI
Code Of Ethics: true
Submission Guidelines: true
Anonymous Url: true
No Acknowledgement Section: true
Submission Number: 25272
Loading