Disentangling the Diversity of Truth in Large Language Models

ACL ARR 2025 July Submission1008 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) can often produce factually incorrect statements, and they offer no citations or internal reasoning. We believe the safe deployment of LLMs requires a deeper understanding of how truth is represented within these models. In this paper, we study the internal representation of truth in LLMs and introduce a taxonomy of truth types, including arithmetic, logical, symbolic, and consensus-based. We use linear probes to identify where in the model different truth types become linearly decodable, and we apply control tasks to distinguish genuine encoding from superficial correlations. Our findings reveal that distinct truth types emerge at different layers. For instance, single-digit sums are encoded earlier than multi-digit ones, suggesting increasing abstraction across depth. To further interpret these internal representations, we train sparse autoencoders on hidden states, revealing human-interpretable features such as patterns like “[person] lived in [place]” or arithmetic involving specific digits. These results highlight structure and specialization in how truth is encoded across transformer layers and neurons. To support future work, we also release a tool for probing and visualizing internal representations across models and datasets.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, knowledge tracing/discovering/inducing, hierarchical & concept explanations, feature attribution
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 4
B2 Discuss The License For Artifacts: No
B2 Elaboration: 4. They are open to be used in this manner.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: 4. They are open to be used in this manner.
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 4
B6 Statistics For Data: Yes
B6 Elaboration: 4
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 4. Number of parameters, yes. I didn't talk about computational budget as it was negligible. MacBook Pro
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 6
C3 Descriptive Statistics: N/A
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: I used LLM-based assistance for some aspects of coding and some text editing, not full writing.
Author Submission Checklist: yes
Submission Number: 1008
Loading