Abstract: Hallucination remains a critical failure mode of large language models (LLMs), undermining their trustworthiness in real-world applications.
In this work, we focus on confabulation, a foundational aspect of hallucination where the model fabricates facts about unknown entities.
We introduce a targeted dataset designed to isolate and analyze this behavior across diverse prompt types. Using this dataset, and building on recent progress in interpreting LLM internals, we extract latent directions associated with confabulation using sparse projections. A simple vector-based steering method demonstrates that these directions can modulate model behavior with minimal disruption, shedding light on the inner representations that drive factual and non-factual output. Our findings contribute to a deeper mechanistic understanding of LLMs and pave the way toward more trustworthy and controllable generation. We release the code and dataset at https://anonymous.4open.science/r/Confabulation-discovery.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Confabulation, Factual Retrieval, Latent Representations, Activation Steering, Contrastive Probing, Behavioral Control, Trustworthiness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, French
Submission Number: 5697
Loading