Keywords: LLMs, representations
TL;DR: We demonstrate that we can extract representations from black-box language models through querying about its behavior; these representations are useful for predicting performance or if a model has been adversarially influenced.
Abstract: As large language models (LLMs) are increasingly relied on in AI systems, predicting and understanding their behavior is crucial. Although a great deal of work in the field uses internal representations to interpret models, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract representations of LLMs in a black-box manner by asking simple elicitation questions and using the probabilities of different responses \emph{as} the representation itself. These representations can, in turn, be used to produce reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance (e.g., accuracy on question-answering tasks). Remarkably, these can often outperform white-box linear predictors that operate over a model’s hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted representations can be used to evaluate more nuanced aspects of a language model's state. For instance, they can be used to distinguish between GPT-3.5 and a version of GPT-3.5 affected by an adversarial system prompt that makes its answers often incorrect. Furthermore, these representations can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4).
Submission Number: 19
Loading