Where Confabulation Lives: Latent Feature Discovery in LLMs

Where Confabulation Lives: Latent Feature Discovery in LLMs

ACL ARR 2025 May Submission5697 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Hallucination remains a critical failure mode of large language models (LLMs), undermining their trustworthiness in real-world applications. In this work, we focus on confabulation, a foundational aspect of hallucination where the model fabricates facts about unknown entities. We introduce a targeted dataset designed to isolate and analyze this behavior across diverse prompt types. Using this dataset, and building on recent progress in interpreting LLM internals, we extract latent directions associated with confabulation using sparse projections. A simple vector-based steering method demonstrates that these directions can modulate model behavior with minimal disruption, shedding light on the inner representations that drive factual and non-factual output. Our findings contribute to a deeper mechanistic understanding of LLMs and pave the way toward more trustworthy and controllable generation. We release the code and dataset at https://anonymous.4open.science/r/Confabulation-discovery.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Confabulation, Factual Retrieval, Latent Representations, Activation Steering, Contrastive Probing, Behavioral Control, Trustworthiness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, French

Submission Number: 5697

Loading