Yes, no, maybe? Revisiting language models' response stability under paraphrasing for the assessment of political leaning
Research Area: Societal implications, Science of LMs, Human mind, brain, philosophy, laws and LMs
Keywords: interpretability, social sciences, bias, robustness
TL;DR: We assessed the stability of model responses to statements from the political compass test using large-scale paraphrasing.
Abstract: An increasing number of studies are aimed at uncovering characteristics such as personality traits or political leanings of language models (LMs), using questionnaires developed for human respondents. From this previous body of work, it is evident that models are highly sensitive to prompt design, including the phrasing of questions and statements, as well as the format of the expected response (e.g., forced choice, vs open-ended). These sensitivities then often lead to inconsistent responses. However, most studies assess response stability on a small scale with low statistical power e.g., using less than ten paraphrases of the same question.
In this work, we investigate the stability of responses to binary forced-choice questions using a large number of paraphrases. Specifically, we probe both masked language models (MLMs) and left-to-right generative language models (GLMs) on the political compass test, assessing response validity (i.e., the proportion of valid responses to a prompt) and response stability (i.e., the variability under paraphrasing) across 500 paraphrases of each statement. This large-scale assessment allows us to approximate the underlying distribution of model responses more precisely, both in terms of the overall stability of a model under paraphrasing as well as the stability of specific items (i.e., the intended meaning of a question). In addition, to investigate whether there are structural biases that drive model responses into a certain direction, we test the association between different word- and sentence-level features, and the models' responses.
We find that while all MLMs exhibit a high degree of response validity, GLMs do not consistently produce valid responses when assessed via forced choice. In terms of response stability, we show that even models that exhibit high overall stability scores flip their responses given certain paraphrases. Crucially, even within-model, response stability can vary considerably between items. We also find that models tend to agree more with statements that show high positive sentiment scores.
Based on our results, we argue that human-centered questionnaires might not be appropriate in the context of probing LMs as both their response validity and stability differ considerably between items. Moreover, although stability metrics represent useful descriptions of model properties, it should be emphasized that even for models exhibiting fairly high stability, specific paraphrases can lead to substantially different model responses.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 880
Loading