Sense and Sensitivity: “Reasoning” Models are More Robust, but can Diverge from Human Consensus in a Legal Interpretation Task
Keywords: Semantic interpretation, legal language, LLMs, human subjects research
TL;DR: Humans are less prompt sensitive than LLMs. Enabling “reasoning” decreases LLM prompt sensitivity, but even “reasoning” LLMs do not consistently match human judgments
Abstract: Can LLMs make metalinguistic judgments? While LLM embeddings are often regarded as high-quality semantic representations, it is not clear that prompting an LLM is a useful way to obtain metalinguistic insights (e.g., whether a DIY gun kit is a “firearm”). While some prior work has suggested LLM prompting can simulate surveys with human participants, computational studies in the domain of legal interpretation have found that LLMs are unreliable for metalinguistic judgments due to prompt sensitivity. However, these studies did not directly compare humans and LLMs on identical tasks, nor did they test so-called “reasoning” models. The current study addresses these gaps by directly comparing the robustness of human and LLM judgments (with and without $\texttt{reasoning}$) in an English-language legal interpretation task. Our results show that LLMs were more sensitive to irrelevant prompt features compared to human participants. Enabling $\texttt{reasoning}$ improved the stability of LLM responses. However, even $\texttt{reasoning}$ model outputs had only moderate correlations with human judgments, and all models sometimes output interpretations that no humans reached in response to the same prompt. We conclude that while $\texttt{reasoning}$ decreases prompt sensitivity, LLMs are still poor proxies for human metalinguistic judgments.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Computational Psycholinguistics, Cognition and Linguistics
Secondary Area Selection: Theoretical Analysis and Interpretation of ML Models for NLP
Use Of Generative Artificial Intelligence Tools: Yes, for writing code
Data Collection From Human Subjects: Yes, with details included in the main paper or in an appendix on (1) how the data was obtained (2) how participants were recruited and paid (3) how consent was obtained (4) whether a IRB protocol was approved for this study. Note that providing this information is obligatory.
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 174
Loading