['3c3', "< Abstract: Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter 'A'. Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.", '---', "> Abstract: Despite their growing popularity, the reliability of using surveys to study large language models (LLMs) remains largely unexamined. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter 'A'. Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. Our findings critically challenge existing interpretations of survey-derived alignment, revealing that perceived alignment often stems from a model's inherent tendency towards uniform responses rather than genuine demographic representation. This work necessitates a re-evaluation of current methodologies for assessing LLM biases and alignment.", '9c9', '< It is tempting to prompt LLMs with survey questions, due to their syntactic similarity to question answering tasks [Brown et al., 2020, Liang et al., 2022]. However, it is a priori unclear how to interpret their answers. Rather than knowledge testing, surveys seek to elicit aggregate statistics over individuals, providing an unbiased view on the properties of the population they are targeting. The quality of survey data hinges on the validity and robustness of the conclusions that can be drawn from it. Clearly, running a survey on LLMs is different from interrogating humans and thus it comes with distinct challenges. While much research has gone into carefully designing surveys to ensure faithful human responses, it is unclear whether prompting LLMs with the same surveys satisfies similar premises out-of-the-box. We devote this work to gain systematic insights into the survey responses of LLMs, what we can expect to learn from them, and to what extent they resemble those of human populations.', '---', '> It is tempting to prompt LLMs with survey questions, due to their syntactic similarity to question answering tasks [Brown et al., 2020, Liang et al., 2022]. However, it is a priori unclear how to interpret their answers. The application of human-centric survey methodologies to LLMs lacks systematic validation. Surveys, fundamentally designed to elicit aggregate statistics from individuals for an unbiased population view, present distinct challenges when applied to LLMs. While extensive research ensures faithful human responses, the direct transferability of these premises to LLM prompting remains an open question. This work systematically investigates the survey responses of LLMs to provide clarity on their interpretability, what can be learned from them, and their resemblance to human populations.', '12c12', '< The basis of our investigation is the American Community Survey1 (ACS), a demographic survey conducted by the U.S. Census Bureau at a national level, on a yearly basis. We curate a questionnaire composing of 25 multiple choice questions from the 2019 ACS. We prompt 43 language models of varying size with these questions, individually and in sequence, and we record their probability distribution over answers. Based on the collected data, we investigate the following two questions: What can we infer about LLMs, and the data they have been trained on, from their survey responses? Does the data generated by prompting models to answer the ACS questionnaire qualitatively resemble the census data collected by surveying the U.S. population? See Figure 1.', '---', '> The basis of our investigation is the American Community Survey (ACS), a demographic survey conducted by the U.S. Census Bureau at a national level, on a yearly basis. From the 2019 ACS, we curate a representative questionnaire comprising 25 multiple-choice questions covering a broad range of demographic and socio-economic topics. We prompt 43 language models of varying size with these questions, individually and in sequence, and record their probability distribution over answers. Based on the collected data, our work addresses two primary research objectives: (1) To what extent can we infer properties about LLMs, or their training data, from their survey responses? (2) Do the survey responses generated by prompting LLMs with the ACS questionnaire qualitatively resemble the census data collected from the U.S. population? See Figure 1.', '14,16c14,15', "< Comparing models' adjusted responses to those of the U.S. census population, we find that natural variations in entropy across questions are not reflected in the responses. Instead, on average across questions, models' responses are no closer to the census population, or the population of any state within the US, than to a fixed uniform baseline. This qualitative difference between model responses and human data puts into question the insights that can be gained from such comparisons. We find that even after instruction-tuning this trend persists, and model responses have consistently higher entropy than any human population we compare to, independent of the survey used. Only for models of size larger than 70 billion parameters we can recognize a trend that the divergence between model responses and the census data decreases after instruction-tuning.", "< With these insights in mind, we inspect conjectures from prior work related to survey derived alignment metrics, that is, that differences in similarity between models' and populations' responses might be attributable to certain demographics being better represented in the training data. Instead, our results suggest a much simpler explanation: the relative alignment of model responses with different demographic subgroups can be explained by the entropy of the subgroups' responses, irrespective of the data or training procedure employed to train the model. We demonstrate this beyond the ACS on other surveys considered by prior work. As such, our findings provide important context to prior studies that employ surveys to examine the biases of LLMs.", "< More broadly, our findings suggest caution when treating language models' survey responses as a faithful representation of any human population, at least a present time, as it could lead to potentially misguided conclusions about alignment.", '---', "> Comparing models' adjusted responses to those of the U.S. census population, we find that natural variations in entropy across questions are not reflected in the responses. Instead, on average across questions, models' responses are no closer to the census population, or the population of any state within the US, than to a fixed uniform baseline. This qualitative difference between model responses and human data fundamentally challenges the validity of insights derived from direct comparisons. We find that even after instruction-tuning, this trend persists, with model responses exhibiting consistently higher entropy than any human population we compare to, independent of the survey used. A notable exception is observed only for models exceeding 70 billion parameters, where a limited trend of decreasing divergence between model responses and census data emerges after instruction-tuning.", "> With these insights in mind, we inspect conjectures from prior work related to survey-derived alignment metrics, specifically that differences in similarity between models' and populations' responses might be attributable to certain demographics being better represented in the training data. Instead, our results unequivocally demonstrate a simpler explanation: the observed relative alignment of model responses with different demographic subgroups is primarily driven by the entropy of the subgroups' responses, rather than reflecting inherent demographic representation or specific training data biases. We demonstrate this beyond the ACS on other surveys considered by prior work. As such, our findings provide crucial context to prior studies that employ surveys to examine the biases of LLMs. More broadly, our findings underscore the critical need for caution when interpreting language models' survey responses as faithful representations of human populations, as current practices risk yielding fundamentally misguided conclusions about alignment and representativeness.", '20c19', '< When evaluating LLMs on the basis of survey questions, the focus is not on the model\'s most likely completion but rather on the probability distribution that the model assigns to various answer choices. For example, not whether the model is more likely to answer "Yes" than "No" to a given survey question, but the normalized probability assigned to each of the two answer choices. See Figure 1. More concretely, Santurkar et al. [2023] study LLMs\' answer distributions for multiple-choice opinion polling questions, measuring their similarity to those of various U.S. demographic groups. They extract models\' answer distributions from the next token probabilities corresponding to each answer choice. Subsequent works employ a similar methodology but instead consider transnational opinion surveys [Durmus et al., 2023, AlKhamissi et al., 2024] and moral beliefs surveys [Scherrer et al., 2024]. We adopt this popular methodology to systematically investigate the properties of models\' answer distributions on the basis of a well-established demographic survey.', '---', '> When evaluating LLMs on the basis of survey questions, the focus is not on the model\'s most likely completion but rather on the probability distribution that the model assigns to various answer choices. For example, not whether the model is more likely to answer "Yes" than "No" to a given survey question, but the normalized probability assigned to each of the two answer choices. See Figure 1. More concretely, Santurkar et al. [2023] study LLMs\' answer distributions for multiple-choice opinion polling questions, measuring their similarity to those of various U.S. demographic groups. They extract models\' answer distributions from the next token probabilities corresponding to each answer choice. Subsequent works employ a similar methodology but instead consider transnational opinion surveys [Durmus et al., 2023, AlKhamissi et al., 2024] and moral beliefs surveys [Scherrer et al., 2024]. While these works adopt this methodology for specific applications, we employ it to *systematically and critically investigate the fundamental properties* of models\' answer distributions using a well-established demographic survey, thereby scrutinizing the underlying assumptions of such evaluations.', '22c21', '< Lastly, there is an emerging body of research that integrates LLMs into computational social science [Ziems et al., 2024]. This includes tasks such as taxonomic labeling, where language models are employed for tasks such as opinion prediction [Kim andLee, 2023, Mellon et al., 2022], and free-form coding, where language models are used to generate explanations for social science constructs [Nelson et al., 2021]. Recent studies have also investigated the feasibility of using LLMs to simulate human participants in psychological, psycholinguistic, and social psychology experiments [Dillion et al., 2023, Aher et al., 2023], or as proxies for specific human populations in social science research [Argyle et al., 2023, Lee et al., 2023, Sanders et al., 2023] and economics [Brand et al., 2023, Horton, 2023]. Within this context, our work suggests caution in relying on the survey responses of LLMs to elicit synthetic responses that resemble those of human populations and highlights potential pitfalls.', '---', '> Lastly, there is an emerging body of research that integrates LLMs into computational social science [Ziems et al., 2024]. This includes tasks such as taxonomic labeling, where language models are employed for tasks such as opinion prediction [Kim andLee, 2023, Mellon et al., 2022], and free-form coding, where language models are used to generate explanations for social science constructs [Nelson et al., 2021]. Recent studies have also investigated the feasibility of using LLMs to simulate human participants in psychological, psycholinguistic, and social psychology experiments [Dillion et al., 2023, Aher et al., 2023], or as proxies for specific human populations in social science research [Argyle et al., 2023, Lee et al., 2023, Sanders et al., 2023] and economics [Brand et al., 2023, Horton, 2023]. Within this context, our work critically intervenes by demonstrating the substantial limitations and potential pitfalls of relying on LLM survey responses to elicit synthetic data that faithfully resembles human populations.', '28c27', '< 2. We query language models with the input prompt and obtain their output distribution over next-token probabilities. We select the k q output probabilities corresponding to each answer choice (e.g., the tokens "A", "B", etc.), and we renormalize to obtain the probability distribution over survey answers.2 .', '---', '> 2. We query language models with the input prompt and obtain their output distribution over next-token probabilities. We select the k q output probabilities corresponding to each answer choice (e.g., the tokens "A", "B", etc.), and we renormalize to obtain the probability distribution over survey answers.', '32c31', "< Reference data & evaluation. We use the responses collected by the U.S. Census Bureau when surveying the U.S. population as our reference data. In particular, we use the 2019 ACS public use microdata sample 3 (henceforth census data). The data contains the anonymized responses of around 3.2 million individuals in the United States. For each survey question q ∈ Q, we denote the census' population-level response as a categorical random variable C q whose event probabilities are the relative frequency of each answer choice among survey respondents. We use U q to denote the uniform distribution over answers. Given these two reference points, we evaluate language models' responses R m q along two dimensions:", '---', "> Reference data & evaluation. We use the responses collected by the U.S. Census Bureau when surveying the U.S. population as our reference data. In particular, we use the 2019 ACS public use microdata sample (henceforth census data). The data contains the anonymized responses of around 3.2 million individuals in the United States. For each survey question q ∈ Q, we denote the census' population-level response as a categorical random variable C q whose event probabilities are the relative frequency of each answer choice among survey respondents. We use U q to denote the uniform distribution over answers. Given these two reference points, we evaluate language models' responses R m q along two dimensions:", '35c34', '< Note that the KL divergence between any distribution and the uniform distribution corresponds to the entropy difference. For normalized entropy this yields KL(C q ∥ U q ) = k q (1 -H(C q )).  Randomized choice ordering. For several investigations we survey models under randomized choice ordering. This means, for a given question q, we prompt models with different permutations of the answer choice ordering, i.e., the assignment of answers (e.g., "male", "female") to choice labels ("A", "B", etc), while the choice labels are kept in alphabetic order. We evaluate models\' survey responses under all possible choice orderings and we use Rm q to denote the expected distribution over answers and Ōm q to denote the expected distribution over selected choice labels. For questions with more than 6 answers we evaluate a maximum of 5000 permutations. For OpenAI\'s models we evaluate up to 50 permutations due to the costs of querying the OpenAI API. This distinction serves to decouple a model\'s tendency towards picking a particular answer from its tendency towards picking a particular choice label. In the following we refer to the expected survey response Rm q under uniformly distributed choice ordering as the adjusted survey response. We will come back to this in Section 4.', '---', '> Note that the KL divergence between any distribution and the uniform distribution corresponds to the entropy difference. For normalized entropy this yields KL(C q ∥ U q ) = k q (1 -H(C q )). Randomized choice ordering. For several investigations we survey models under randomized choice ordering. This means, for a given question q, we prompt models with different permutations of the answer choice ordering, i.e., the assignment of answers (e.g., "male", "female") to choice labels ("A", "B", etc), while the choice labels are kept in alphabetic order. We evaluate models\' survey responses under all possible choice orderings and we use Rm q to denote the expected distribution over answers and Ōm q to denote the expected distribution over selected choice labels. For questions with more than 6 answers we evaluate a maximum of 5000 permutations. For OpenAI\'s models we evaluate up to 50 permutations due to the costs of querying the OpenAI API. This distinction serves to decouple a model\'s tendency towards picking a particular answer from its tendency towards picking a particular choice label. In the following we refer to the expected survey response Rm q under uniformly distributed choice ordering as the adjusted survey response. We will come back to this in Section 4.', '39c38', '< For a first investigation, we consider the normalized entropy of models\' responses to the "SEX", "HICOV", and "FER" questions. The SEX question inquiries about the person\'s sex, encoded as male female, the HICOV question inquiries whether the person is currently covered by any health insurance plan, and the FER question inquires whether the person has given birth in the past 12 months. When surveying the U.S. population, these three questions elicit responses with very different entropy; responses to the SEX question are almost uniformly distributed, whereas most people answer "No" to the FER question. In contrast, as shown in Figure 2(a), the entropy of models\' responses to these three questions are surprisingly similar. In particular, we find that the entropy of models\' responses tends to increase log-linearly with model size, independent of the question asked. This trend is consistent across all ACS survey questions, see Figure 8 in Appendix B.1.', '---', '> For a first investigation, we consider the normalized entropy of models\' responses to the "SEX", "HICOV", and "FER" questions. The SEX question inquiries about the person\'s sex, with binary choices \'male\' or \'female\', the HICOV question inquiries whether the person is currently covered by any health insurance plan, and the FER question inquires whether the person has given birth in the past 12 months. When surveying the U.S. population, these three questions elicit responses with very different entropy; responses to the SEX question are almost uniformly distributed, whereas most people answer "No" to the FER question. In contrast, as shown in Figure 2(a), the entropy of models\' responses to these three questions are surprisingly similar. In particular, we find that the entropy of models\' responses tends to increase log-linearly with model size, independent of the question asked. This trend is consistent across all ACS survey questions, see Figure 8 in Appendix B.1.', '44c43', "< Overall, we find that models' response distributions seem to be widely independent of the survey question asked, and variations across models are much larger than variations across questions. This lead us to suspect that observed differences across models might arise mostly due to systematic biases.", '---', "> Overall, we find that models' response distributions seem to be widely independent of the survey question asked, and variations across models are much larger than variations across questions. This led us to strongly suspect that observed differences across models primarily arise due to systematic biases.", '51c50', '< We measure A-bias for each question q and model m. Results are illustrated in Figure 3. We again sort models by their size. We observe all models exhibit substantial A-bias. However, models in the order of a few billion parameters or fewer consistently exhibit particularly strong A-bias, and tend towards mono answers. We additionally observe that the strength of A-bias in instruction or RLHF tuned models is similar to that of base models, see Appendix B.2. A plausible explanation for small models exhibiting strong A-bias is that the ability to answer MMLU-style multiple-choice questions only emerges for models of sufficient scale [Dominguez-Olmedo et al., 2024].', '---', '> We measure A-bias for each question q and model m. Results are illustrated in Figure 3. We again sort models by their size. We observe all models exhibit substantial A-bias. However, models in the order of a few billion parameters or fewer consistently exhibit particularly strong A-bias, and tend towards unimodal answers. We additionally observe that the strength of A-bias in instruction or RLHF tuned models is similar to that of base models, see Appendix B.2. A plausible explanation for small models exhibiting strong A-bias is that the ability to answer MMLU-style multiple-choice questions only emerges for models of sufficient scale [Dominguez-Olmedo et al., 2024].', '70c69', '< Inspired by the alignment measures proposed by Santurkar et al. [2023] and Durmus et al. [2023], we investigate the similarity of model responses to the census data by evaluating the average divergence across questions between model responses and the census statistics. 4 As we focus on categorical questions, we evaluate average KL divergence between each language model m and each reference population Ref, as follows:', '---', '> Inspired by the alignment measures proposed by Santurkar et al. [2023] and Durmus et al. [2023], we investigate the similarity of model responses to the census data by evaluating the average divergence across questions between model responses and the census statistics. As we focus on categorical questions, we evaluate average KL divergence between each language model m and each reference population Ref, as follows:', '73c72', "< Looking at Figure 5 we find no consistent trend that instruction-tuning would move responses closer to the census, despite the increased deviation from uniform and the larger variations in entropy (recall Figure 4). Only for larger models the divergence seems to clearly decrease with instruction-tuning. However, all models' responses still remain significantly closer to the uniform baseline than to the U.S. census. For instance, for the GPT-4 model whose answers exhibit the highest similarity to the human reference populations, only 6 out of 25 questions (24%) are closer to the U.S. census than to the uniform baseline. Given these results, drawing conclusions about the relative alignment of models with subgroups is prone to resulting in brittle conclusions.", '---', "> Looking at Figure 5 we find no consistent trend that instruction-tuning would move responses closer to the census, despite the increased deviation from uniform and the larger variations in entropy (recall Figure 4). Only for larger models the divergence seems to clearly decrease with instruction-tuning. However, all models' responses still remain significantly closer to the uniform baseline than to the U.S. census. For instance, for the GPT-4 model whose answers exhibit the highest similarity to the human reference populations, only 6 out of 25 questions (24%) are closer to the U.S. census than to the uniform baseline. Given these pervasive results, drawing robust conclusions about the relative alignment of models with specific subgroups proves highly challenging and prone to significant misinterpretation.", '76c75', '< Our findings add important context to previous works studying the alignment of language models with different human subpopulations. In particular, we highlighted the tendency of models towards balanced answers. Due to varying entropy in the responses of subgroups this leads to a strong correlation between model alignment and the reference population\'s entropy. The linear trend in Figure 6 visualizes this. For any given model, it consistently appears to be more "aligned" with the subpopulations exhibiting high entropy in their answers. Interestingly, we find that this trend also holds pre-adjustment, suggesting that the transformation of the response through randomized choice ordering is orthogonal to differentiating aspects of any specific population. In contrast, when comparing different models in Figure 6, we can see how adjustment has a large influence on their relative order. Differences across models that we see under naive prompting disappear after adjustment, which means that they should largely be attributed to systematic biases, rather than inherent properties of the model.  Taken together our findings imply that the survey-derived alignment measure is more informative of differences in the reference populations rather than the language models is aims to evaluate. Model particularities, such as the pre-training data used, instruction tuning or the use of reinforcement learning with human feedback, seem to have little impact on which population is best represented.', '---', '> Our findings add important context to previous works studying the alignment of language models with different human subpopulations. In particular, we highlighted the tendency of models towards balanced answers. Due to varying entropy in the responses of subgroups this leads to a strong correlation between model alignment and the reference population\'s entropy. The linear trend in Figure 6 visualizes this. For any given model, it consistently appears to be more "aligned" with the subpopulations exhibiting high entropy in their answers. Interestingly, we find that this trend also holds pre-adjustment, suggesting that the transformation of the response through randomized choice ordering is orthogonal to differentiating aspects of any specific population. In contrast, when comparing different models in Figure 6, we can see how adjustment has a large influence on their relative order. Differences across models that we see under naive prompting disappear after adjustment, which means that they should largely be attributed to systematic biases, rather than inherent properties of the model. Taken together, our findings imply that the survey-derived alignment measure primarily reflects differences in the reference populations rather than revealing inherent properties of the language models it aims to evaluate. Model particularities, such as the pre-training data used, instruction tuning or the use of reinforcement learning with human feedback, seem to have little impact on which population is best represented.', '79c78', '< To inspect whether this trend changes with the content of the questions asked, we reproduce our experiments with additional surveys. We use the American Trends Panel (ATP) opinion surveys considered by Santurkar et al. [2023], and the Pew Research\'s Global Attitudes Surveys (GAS) and World Values Surveys (WVS) considered by Durmus et al. [2023]. These surveys encompass around 1500 questions and 60 U.S. demographic subgroups, and around 2300 questions and 60 national populations, respectively. We adopt the alignment metrics considered by the aforementioned works. We find that our insights gained from the ACS also hold for the ATP and GAS/WVS surveys. In particular, we similarly find a linear trend between the alignment metrics and subgroups\' entropy of responses, in particular after adjustment, see Figure 7. Note here that alignment and divergence are negatively correlated by definition. Interestingly, this observation explains some of the findings in prior works. For example, Santurkar et al. [2023] find that "all the base models share striking similarities-e.g., being most aligned with lower income, moderate, and Protestant or Roman Catholic groups" and "our analysis [.  Figure 7: Alignment beyond ACS for selected models. We adopt the measures of Santurkar et al. [2023] and Durmus et al. [2023] on ATP and GAS/VVS opinion surveys. Again, the alignment between models and a given subpopulation correlates with the entropy of the subpopulations\' responses. groups]". Our results suggest that this could be an artifact of systematic biases. For the ATP surveys, we observe three outliers for which its alignment before adjustment is not correlated with the entropy of subgroup\'s responses: Llama 2 70B Chat and the two Llama 3 Instruct models. These are the models with largest pre-training compute considered. However, after adjustment, the alignment trends of Llama 2 70B Chat and the Llama 3 Instruct models are remarkably similar to that of their corresponding base models and all other LLMs.', '---', '> To inspect whether this trend changes with the content of the questions asked, we reproduce our experiments with additional surveys. We use the American Trends Panel (ATP) opinion surveys considered by Santurkar et al. [2023], and the Pew Research\'s Global Attitudes Surveys (GAS) and World Values Surveys (WVS) considered by Durmus et al. [2023]. These surveys encompass around 1500 questions and 60 U.S. demographic subgroups, and around 2300 questions and 60 national populations, respectively. We adopt the alignment metrics considered by the aforementioned works. We find that our insights gained from the ACS also hold for the ATP and GAS/WVS surveys. In particular, we similarly find a linear trend between the alignment metrics and subgroups\' entropy of responses, in particular after adjustment, see Figure 7. Note here that alignment and divergence are negatively correlated by definition. Interestingly, this observation explains some of the findings in prior works. For example, Santurkar et al. [2023] find that "all the base models share striking similarities-e.g., being most aligned with lower income, moderate, and Protestant or Roman Catholic groups." Our results suggest that this could be an artifact of systematic biases, as the alignment between models and a given subpopulation consistently correlates with the entropy of the subpopulations\' responses, as further detailed in Figure 7. For the ATP surveys, we observe three outliers for which its alignment before adjustment is not correlated with the entropy of subgroup\'s responses: Llama 2 70B Chat and the two Llama 3 Instruct models. These are the models with largest pre-training compute considered. However, after adjustment, the alignment trends of Llama 2 70B Chat and the Llama 3 Instruct models are remarkably similar to that of their corresponding base models and all other LLMs.', '82c81', "< We used a popular methodology to elicit LLMs' answer distributions to survey questions and closely examined the responses on the basis of the prime US demographic survey. We found that model responses are dominated by systematic ordering biases and do not exhibit the natural variations in entropy found in the human reference data collected by the US census. Even after adjusting for ordering biases, LLMs' responses still do not resemble those of human populations. Instead, they exhibit consistently high entropy, independent of the question asked. This holds true irrespective of model size or fine-tuning with human preferences.", '---', "> This work rigorously examined a popular methodology to elicit LLMs' answer distributions to survey questions, focusing on the premier US demographic survey. We found that model responses are dominated by systematic ordering biases and do not exhibit the natural variations in entropy found in the human reference data collected by the US census. Even after adjusting for ordering biases, LLMs' responses still do not resemble those of human populations. Instead, they exhibit consistently high entropy, independent of the question asked. This holds true irrespective of model size or fine-tuning with human preferences.", '84c83', '< We want to reiterate that our focus lies on questioning a popular methodology of eliciting survey responses from large language models using multiple choice prompting. At the example of this methodology our results highlight an important pitfall and suggest caution to expect robust insights when comparing such responses against those of human populations. The robustness and quality of an established survey does not seamlessly translate from the results obtained by surveying human populations to the logits output by LLMs. More research is urgently needed to design methodologies for getting insights into the inherent biases of LLMs and the population they might represent. Here public surveys and their accompanying data offer exciting potential and the could play an important role as a benchmarking tool for systematic evaluations of LLMs, see [Cruz et al., 2024] ', '---', '> We want to reiterate that our focus lies on questioning a popular methodology of eliciting survey responses from large language models using multiple-choice prompting. Using this methodology as an example, our results highlight an important pitfall and suggest caution when expecting robust insights from comparisons of LLM responses against those of human populations. The robustness and quality of an established survey does not seamlessly translate from the results obtained by surveying human populations to the logits output by LLMs. More research is urgently needed to design methodologies for gaining insights into the inherent biases of LLMs and the populations they might represent. Public surveys and their accompanying data offer exciting potential, and they could play an important role as a benchmarking tool for systematic evaluations of LLMs, see [Cruz et al., 2024].', '87c86', '< We use the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files made available by the U.S. Census Bureau. 5 The data itself is governed by the terms of use provided by the Census Bureau. 6 We download the data directly from the U.S. Census using the Folktables Python package [Ding et al., 2021]. We download the files corresponding to the year 2019.', '---', '> We use the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files made available by the U.S. Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. We download the data directly from the U.S. Census using the Folktables Python package [Ding et al., 2021]. We download the files corresponding to the year 2019.', '89c88', '< We open source the code to replicate all experiments.7 ', '---', '> We open source the code to replicate all experiments.', '109c108', "< results presented here complement those of Section 5. We plot the average KL divergence between each language model and each demopgrahic subpopulation (U.S. state) against the average entropy of the subgroup's responses. For readability, we split models into GPT-2 and GPT-Neo (Figure 6 ", '---', "> The results presented here complement those of Section 5. We plot the average KL divergence between each language model and each demographic subpopulation (U.S. state) against the average entropy of the subgroup's responses. For readability, we split models into groups, as illustrated in Figure 6 and Appendix Figure 11.", '113,114c112', '< 1. We randomize both the order in which choices are presented and the label (i.e., letter) assigned to each answer choice. For example, for the "sex" question, the possible combinations are "A. Male B. Female", "A. Female B. Male", "B. Male A. Female", and "B. Female A.', '< Male". Note that in the experiments presented in Section 3.1 we only randomized over the order in which choices are presented (i.e., the "A" choice was always presented first). 2. We compute the output distribution over responses for choice position (the probability assigned to the first, second, etc., answer choice presented) and letter assignment (the probability assigned to the answer choice assigned "A", "B", etc.).', '---', '> 1. We randomize both the order in which choices are presented and the label (i.e., letter) assigned to each answer choice. For example, for the "sex" question, the possible combinations are "A. Male B. Female", "A. Female B. Male", "B. Male A. Female", and "B. Female A. Male". Note that in the experiments presented in Section 3.1 we only randomized over the order in which choices are presented (i.e., the "A" choice was always presented first). 2. We compute the output distribution over responses for choice position (the probability assigned to the first, second, etc., answer choice presented) and letter assignment (the probability assigned to the answer choice assigned "A", "B", etc.).', '119c117', '< We find that models exhibit significant positioning and labelling for most survey questions, see Figure 10. We observe that labelling is more prevalent that positioning bias. While both tend to decrease with model size, order bias decreases more significantly with model size, whereas labeling bias tends to be very prevalent across all model sizes. In Figure 11 we plot both the strength of A-bias and first-choice bias across survey questions. The strength of A-bias tends to be greater than that of first-choice bias, particularly for the smaller models.  ', '---', '> We find that models exhibit significant positioning and labeling for most survey questions, see Figure 10. We observe that labeling is more prevalent than positioning bias. While both tend to decrease with model size, order bias decreases more significantly with model size, whereas labeling bias tends to be very prevalent across all model sizes. In Figure 11 we plot both the strength of A-bias and first-choice bias across survey questions. The strength of A-bias tends to be greater than that of first-choice bias, particularly for the smaller models.  ', '125c123', '< Motivated by the I-bias experiment, we now examine whether labeling bias can be mitigated by using letters that have similar frequency in written English. Therefore, instead of assigning to choices the  Figure 13: "R", "S", "N", etc. randomization experiment. All models, irrespective of size, exhibit statistically significant letter and positioning bias for most survey questions. labels "A", "B", etc. we assign the following labels: "R", "S", "N", "L", "O", "T", "M", "P", "W", "U", "Y", "V". We find that, compared to the "A", "B", etc. randomization experiment, the percentage of questions for which models exhibit significant labeling bias somewhat decreases (Figure 13). However, models tend to exhibit substantially more position bias. This indicates that, in the absence of a label that provides a strong signal (e.g., "A" or "I"), models tend to exhibit significantly higher choice-ordering bias, irrespective of model size.', '---', '> Motivated by the I-bias experiment, we now examine whether labeling bias can be mitigated by using letters that have similar frequency in written English. Therefore, instead of assigning to choices the labels "A", "B", etc., we assign the following labels: "R", "S", "N", "L", "O", "T", "M", "P", "W", "U", "Y", "V" in a randomization experiment, as shown in Figure 13. In this setup, all models, irrespective of size, continue to exhibit statistically significant letter and positioning bias for most survey questions. We find that, compared to the "A", "B", etc. randomization experiment, the percentage of questions for which models exhibit significant labeling bias somewhat decreases. However, models tend to exhibit substantially more position bias. This indicates that, in the absence of a label that provides a strong signal (e.g., "A" or "I"), models tend to exhibit significantly higher choice-ordering bias, irrespective of model size.', '128c126', "< We reproduce our experiments using different prompts to query the model. Due to the cost of querying OpenAI's models, we only perform these ablations for models with publicly available weights. The notebooks with all figures can be retrieved from our Github repository.8 ", '---', "> We reproduce our experiments using different prompts to query the model. Due to the cost of querying OpenAI's models, we only perform these ablations for models with publicly available weights. The notebooks with all figures can be retrieved from our Github repository.", '130,131c128,129', '< D.1 System rompt used for GPT-3.5 and GPT-4', '< When querying GPT-3.5, GPT-4, and GPT-4 Turbo, we use the system prompt Please respond with a single letter., as otherwise for most questions none of the top-5 logits correspond to answer choice labels (e.g., "A", "B"). Note that this problematic arises due to the fact that the OpenAI API only allows access to the top 5 logits. We adapt the system prompt used by Dorner et al. [2023] in the context of surveying GPT-4 with standarized personality tests.', '---', '> D.1 System prompt used for GPT-3.5 and GPT-4', '> When querying GPT-3.5, GPT-4, and GPT-4 Turbo, we use the system prompt Please respond with a single letter., as otherwise for most questions none of the top-5 logits correspond to answer choice labels (e.g., "A", "B"). Note that this problem arises due to the fact that the OpenAI API only allows access to the top 5 logits. We adapt the system prompt used by Dorner et al. [2023] in the context of surveying GPT-4 with standardized personality tests.', '145c143', "< We reproduce the experiments of Sections 3 and 4 using the ATP, and GAS/WVS used by Santurkar et al. [2023] and Durmus et al. [2023], where questions are presented individually of one another. We additionally reproduce the experiments of Section 5 using the 2016 ANES questionnaire considered by Argyle et al. [2023], where questions are presented in sequence. We do not consider OpenAI's models as the cost to reproduce the experiments via the OpenAI API exceeds our budget. We obtain very similar results to those of the ACS presented in the main text of the paper. The notebooks with all figures can be retrieved from our Github repository.9 ", '---', "> We reproduce the experiments of Sections 3 and 4 using the ATP, and GAS/WVS used by Santurkar et al. [2023] and Durmus et al. [2023], where questions are presented individually of one another. We additionally reproduce the experiments of Section 5 using the 2016 ANES questionnaire considered by Argyle et al. [2023], where questions are presented in sequence. We do not consider OpenAI's models as the cost to reproduce the experiments via the OpenAI API exceeds our budget. We obtain very similar results to those of the ACS presented in the main text of the paper. The notebooks with all figures can be retrieved from our Github repository.", '148c146', '< We obtain the ATP survey questions and their corresponding human responses from the OpinionsQA repository. 10 We present all answer choices when querying the models, but exclude the answer choices corresponding to refusals from our analysis similarly to Santurkar et al. [2023]. When comparing the similarity of models\' responses to different demographic subgroups, we use the demographic subgroups and the alignment metric considered by Santurkar et al. [2023]. For such metric, higher values of alignment indicate that models\' responses are more similar to the reference demographic group. We find that all models are more "aligned" with the uniformly random baseline than with any of the demographic subgroups, see Figure 14.', '---', '> We obtain the ATP survey questions and their corresponding human responses from the OpinionsQA repository. We present all answer choices when querying the models, but exclude the answer choices corresponding to refusals from our analysis similarly to Santurkar et al. [2023]. When comparing the similarity of models\' responses to different demographic subgroups, we use the demographic subgroups and the alignment metric considered by Santurkar et al. [2023]. For such metric, higher values of alignment indicate that models\' responses are more similar to the reference demographic group. We find that all models are more "aligned" with the uniformly random baseline than with any of the demographic subgroups, see Figure 14.', '151c149', "< We obtain the ATP survey questions and their corresponding human responses from the GlobalOpin-ionsQA repository.11 When comparing the similarity of models' responses to the population-level survey responses of different countries, we use the countries and the similarity metric considered by Durmus et al. [2023]. We find that all models produce survey responses that are more similar to those of the uniformly random baseline than to those of any of the demographic subgroups, see Figure 15. Figure 16: The discriminator test performed on datasets generated using the 2016 ANES survey questionnaire (with choice randomization).", '---', "> We obtain the ATP survey questions and their corresponding human responses from the GlobalOpinionsQA repository. When comparing the similarity of models' responses to the population-level survey responses of different countries, we use the countries and the similarity metric considered by Durmus et al. [2023]. We find that all models produce survey responses that are more similar to those of the uniformly random baseline than to those of any of the demographic subgroups, see Figure 15.", '157c155', '< We present questions in the multiple-choice format described in Section 2, using the Interviewer:, Me: prompt style described by Argyle et al. [2023]. We retrieve the 2016 ANES data from the official website12 , and process it such that it matches in form the questionnaire designed by Argyle et al. [2023]. We find that the trained classifiers can discriminate between the model-generated data and the ANES data with very high accuracy (≥99%), see Figure 16.', '---', '> We present questions in the multiple-choice format described in Section 2, using the Interviewer:, Me: prompt style described by Argyle et al. [2023]. We retrieve the 2016 ANES data from the official website, and process it such that it matches in form the questionnaire designed by Argyle et al. [2023]. We find that the trained classifiers can discriminate between the model-generated data and the ANES data with very high accuracy (≥99%), see Figure 16.', '160,162c158', '< Motivated by recent findings of Argyle et al. [2023] we conducted an additional investigation where we seek to fill entire ACS questionnaires in a sequential manner, in order to generate for each language model a synthetic dataset of responses. This data emulates in form the ACS dataset collected by the For all language models, it is possible to discriminate with very high accuracy between the ACS census data and model-generated data, ( ) before adjustment and ( ) after adjustment. We contrast this against the accuracy value of discriminating between the ACS data of any given U.S. state and the rest of the ACS census data (-).', '< 12 questions from the 2016 American National Election Studies (ANES) survey. For every human respondent, they construct a corresponding "silicon individual" by querying GPT-3 to predict the ANES respondent\'s answer to each survey question given the respondent\'s answers to all other questions. Their results indicate that, for the 2016 ANES survey, GPT-3 can be a fairly calibrated predictor of an individual\'s answer to some survey question conditioned on the respondent\'s answers to all other survey questions. 14However, Argyle et al. [2023] emphasize that important insights can be gained by emulating the survey responses of human populations "prior to or in the absence of human data". In this work we have considered precisely the setting where models\' responses are obtained in the absence of human data. 15 To investigate how our findings transfer to the ANES, we reproduce the experiments of Section F using the 2016 ANES survey questionnaire considered by Argyle et al. [2023] and their "interview-style" prompt. We apply the discriminator test, and find that the trained classifiers can discriminate between the model-generated data and the ANES data with accuracy > 99 % (see Appendix E), indicating that models\' responses are markedly different to those in the ANES data.', "< Thus, the fact that models may perform reasonably well at feature imputation tasks (e.g., predicting an individual's answer to some question given their answers to all other questions) does not imply that models can generate synthetic respondents that resemble the responses obtained by surveying human populations. This suggests caution when using LLMs to emulate human populations at present time, in particular in the absence of human data.", '---', '> Motivated by recent findings of Argyle et al. [2023], we conducted an additional investigation where we seek to fill entire ACS questionnaires in a sequential manner, in order to generate for each language model a synthetic dataset of responses. This data emulates in form the ACS dataset collected by the U.S. Census Bureau, and we then study the extent to which such synthetic datasets resemble the ACS dataset.', '164,209d159', '< Section: Answer: [Yes]', '< Justification:', '< Guidelines:', '< • The answer NA means that paper does not include experiments requiring code. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.', '< ', '< Section: Experimental Setting/Details', '< Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?', '< Answer: [Yes] Justification: See Appendix A and the code release.', '< Guidelines:', '< • The answer NA means that the paper does not include experiments.', '< • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.', '< ', '< Section: Experiment Statistical Significance', '< Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?', '< Answer: [Yes] Justification: The main figures contain exact measures. We conduct significance tests on Section B.', '< Guidelines:', '< • The answer NA means that the paper does not include experiments.', '< • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors).', '< • It should be clear whether the error bar is the standard deviation or the standard error of the mean. Guidelines:', '< • The answer NA means that there is no societal impact of the work performed.', '< • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.', '< • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).', '< ', '< Section: Safeguards', '< Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?', "< Answer: [NA] Justification: We do not release any models. We release models' survey responses, which pose no risk of misuse.", '< Guidelines:', '< • The answer NA means that the paper poses no such risks.', '< • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.', '< 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?', '< Answer: [Yes] Justification: We properly credit the authors of the models considered, as well as the sources of the surveys considered.', '< Guidelines:', '< • The answer NA means that the paper does not use existing assets.', '< • The authors should cite the original paper that produced the code package or dataset.', '< • The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.', '< • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.', "< • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • If this information is not available online, the authors are encouraged to reach out to the asset's creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Guidelines:", '< • The answer NA means that the paper does not release new assets.', '< • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.', '< ', '< Section: Acknowledgements', '< The authors would like to thank Frauke Kreuter and the Social Data Science and AI Lab at Ludwig-Maximilians-Universität Munich for inspiring discussions on an earlier version of this manuscript. Celestine Mendler-Dünner acknowledges financial support from the Hector Foundation.', '< ', '< Section: ', '< U.S. Census Bureau. We then study the extent to which such synthetic datasets resemble the ACS dataset.', '< ', '211c161', "< We present survey questions in the same order as in the ACS questionnaire. When querying a model to answer survey question q, we include a summary of the q -1 previously sampled answers in context. 13 We then sample from the model's output probability distribution over answers, and continue to the next question. We illustrate this sequential process in Figure 17. We refer to Appendix D.3 for results collected with different variations of how a model's previous answers are integrated into the prompt. We find our results to be robust to these prompt variations.", '---', "> We present survey questions in the same order as in the ACS questionnaire. When querying a model to answer survey question q, we include a summary of the q -1 previously sampled answers in context. We then sample from the model's output probability distribution over answers, and continue to the next question. We illustrate this sequential process in Figure 17. We refer to Appendix D.3 for results collected with different variations of how a model's previous answers are integrated into the prompt. We find our results to be robust to these prompt variations.", '220c170,172', '< Argyle et al. [2023] propose "silicon sampling", a methodology to produce synthetic survey respondents using LLMs by conditioning on actual survey respondents. They focus on a subset of Guidelines:', '---', '> Argyle et al. [2023] propose "silicon sampling", a methodology to produce synthetic survey respondents using LLMs by conditioning on actual survey respondents. They focus on a subset of 12 questions from the 2016 American National Election Studies (ANES) survey. For every human respondent, they construct a corresponding "silicon individual" by querying GPT-3 to predict the ANES respondent\'s answer to each survey question given the respondent\'s answers to all other survey questions. However, Argyle et al. [2023] emphasize that important insights can be gained by emulating the survey responses of human populations "prior to or in the absence of human data". In this work we have considered precisely the setting where models\' responses are obtained in the absence of human data. To investigate how our findings transfer to the ANES, we reproduce the experiments of Section F using the 2016 ANES survey questionnaire considered by Argyle et al. [2023] and their "interview-style" prompt. We apply the discriminator test, and find that the trained classifiers can discriminate between the model-generated data and the ANES data with accuracy > 99 % (see Appendix E), indicating that models\' responses are markedly different to those in the ANES data.', "> Thus, the fact that models may perform reasonably well at feature imputation tasks (e.g., predicting an individual's answer to some question given their answers to all other questions) does not imply that models can generate synthetic respondents that resemble the responses obtained by surveying human populations. This suggests caution when using LLMs to emulate human populations at present time, in particular in the absence of human data.", '> Guidelines:', '399d350', '< ']
