Disentangling the Diversity of Truth in Large Language Models

Disentangling the Diversity of Truth in Large Language Models

ACL ARR 2025 July Submission1008 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) can often produce factually incorrect statements, and they offer no citations or internal reasoning. We believe the safe deployment of LLMs requires a deeper understanding of how truth is represented within these models. In this paper, we study the internal representation of truth in LLMs and introduce a taxonomy of truth types, including arithmetic, logical, symbolic, and consensus-based. We use linear probes to identify where in the model different truth types become linearly decodable, and we apply control tasks to distinguish genuine encoding from superficial correlations. Our findings reveal that distinct truth types emerge at different layers. For instance, single-digit sums are encoded earlier than multi-digit ones, suggesting increasing abstraction across depth. To further interpret these internal representations, we train sparse autoencoders on hidden states, revealing human-interpretable features such as patterns like “[person] lived in [place]” or arithmetic involving specific digits. These results highlight structure and specialization in how truth is encoded across transformer layers and neurons. To support future work, we also release a tool for probing and visualizing internal representations across models and datasets.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing, knowledge tracing/discovering/inducing, hierarchical & concept explanations, feature attribution

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 4

B2 Discuss The License For Artifacts: No

B2 Elaboration: 4. They are open to be used in this manner.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: 4. They are open to be used in this manner.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: 4

B6 Statistics For Data: Yes

B6 Elaboration: 4

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 4. Number of parameters, yes. I didn't talk about computational budget as it was negligible. MacBook Pro

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 6

C3 Descriptive Statistics: N/A

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: I used LLM-based assistance for some aspects of coding and some text editing, not full writing.

Author Submission Checklist: yes

Submission Number: 1008

Loading