Trilemma of Truth in Large Language Models

Published: 30 Sept 2025, Last Modified: 14 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: Code repository: https://github.com/carlomarxdk/trilemma-of-truth | Dataset: https://huggingface.co/datasets/carlomarxx/trilemma-of-truth
Keywords: Applications of interpretability, Understanding high-level properties of models, Interpretability tooling and software
Other Keywords: Probing method, veracity, facts, multiple-instance learning, conformal predictions, uncertainty calibration, knowledge representation, interpretability; model auditing
TL;DR: Three-valued probing of veracity in large language models
Abstract: The public often attributes human-like qualities to large language models (LLMs) and assumes they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.
Submission Number: 150
Loading