Truthfulness in LLMs: A Layer-wise Comparative Analysis of Representation Engineering and Contrast-Consistent Search

ICLR 2025 Workshop BuildingTrust Submission73 Authors

10 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: Truthfulness, Large Language Models, LLM Transparency, Representation Engineering, Contrast-Consistent Search, Interpretability
Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the need for greater transparency in their internal representations. This study presents a layer-wise analysis of truthfulness storage in LLMs, comparing two state-of-the-art knowledge probing methodologies: Representation Engineering (RepE) and Contrast-Consistent Search (CCS). Our goal is to isolate truthfulness, defined as the factual accuracy of LLM outputs, from general knowledge encoded across model layers and to examine where and how this information is stored. RepE applies low-rank transformations within the model’s internal vector space, while CCS leverages pre-trained fixed vectors with an additional transformation layer to define truthfulness. Through experiments on Google’s Gemma models, evaluated across five diverse datasets, we find that truthfulness is embedded within pre-trained LLMs and can be amplified by specific input words. Our analysis reveals general trends in truthfulness storage and transferability, with CCS demonstrating greater stability in assessing truthfulness, while RepE exhibits potential in deeper layers but requires further refinement. Surprisingly, the truthfulness differences in the final layer, often considered the most critical, were statistically insignificant. This study provides empirical insights into the internal encoding of truthfulness in LLMs, highlighting the strengths and limitations of representation based transparency methods.
Submission Number: 73
Loading