Keywords: large langauge models, monitoring, black-box
TL;DR: We provide a technique to monitor and predict the behavior of LLMs in a black-box setting.
Abstract: Reliably predicting the behavior of language models---such as whether their outputs are correct or have been adversarially manipulated---is a fundamentally challenging task.
This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access.
In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses \emph{as} representations to train reliable predictors.
We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks.
Surprisingly, this can \textit{even outperform white-box linear predictors} that operate over model internals or activations.
Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code.
Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API.
Overall, our work shows promise for the reliable monitoring of black-box LLM behavior, supporting their responsible deployment in autonomous systems.
Submission Number: 78
Loading