Challenges with unsupervised LLM knowledge discovery

09 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Eliciting latent knowledge, large language models, deception
TL;DR: Theorems and experiments reveal that past attempts to discover latent knowledge in large language models in fact discover things other than knowledge, and this challenge seems likely to continue.
Abstract: We reveal novel pathologies in existing unsupervised methods seeking to discover latent knowledge from large language model (LLM) activations---instead of knowledge they seem to discover whatever feature of the activations is most prominent. These methods search for hypothesised consistency structures of latent knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a popular unsupervised knowledge-elicitation method: contrast-consistent search. We then present a series of experiments showing settings in which this and other unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. We offer conceptual arguments grounded in identification issues such as distinguishing a model's knowledge from that of a simulated character's that are likely to persist in future unsupervised methods.
Primary Area: Interpretability and explainability
Submission Number: 3383
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview