Can Large Language Models Find Connections between Social Beliefs?

ACL ARR 2025 July Submission1119 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding how people’s perspectives on different issues change in correspondence with one another is essential for modeling collective reasoning and social dynamics. However, this problem remains underexplored due to the absence of standardized benchmarks and evaluation protocols. In this work, we introduce \textsc{BeliefBench}, a new benchmark for evaluating whether large language models (LLMs) can detect when shifts in beliefs about one real-world event are accompanied by corresponding shifts in beliefs about another. The benchmark is constructed from Polymarket, a prediction market platform where event probabilities are updated daily, reflecting crowd belief over time. We formulate a classification task in which event pairs are labeled based on a combination of time-series co-movement, semantic similarity, and other metadata. Label quality is validated by human annotators. Our evaluation reveals two key findings: (1) LLMs consistently outperform heuristic and neural baselines in identifying meaningful belief correlations across diverse domains; (2) Chain-of-Thought prompting improves performance in settings that require multi-step reasoning, such as politics and elections, but can hurt performance in domains where surface-level signals are more predictive. \textsc{BeliefBench} thus provides a challenging testbed for evaluating how well LLMs capture the co-evolution of perspectives and the underlying temporal and causal reasoning processes.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: Social Belief Modeling, Large Language Models (LLMs), Multi-hop Reasoning
Contribution Types: Model analysis & interpretability
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section 10 and Section 11
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Appendix A
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Appendix A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Appendix A
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: See Appendix A. The dataset contains no PII or offensive content. All data is event-level and anonymized.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Appendix A
B6 Statistics For Data: Yes
B6 Elaboration: Appendix A
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: We used black-box APIs (e.g., OpenAI GPT-4, Claude 3, Gemini 1.5) without direct access to model parameters or FLOPs. This is explained in Appendix A.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 5, Section 6, Appendix B
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 6
C4 Parameters For Packages: No
C4 Elaboration: We did not use traditional NLP libraries (e.g., NLTK, ROUGE); our evaluations used custom scripts and API outputs.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix G
D2 Recruitment And Payment: Yes
D2 Elaboration: Appendix G
D3 Data Consent: Yes
D3 Elaboration: Appendix G
D4 Ethics Review Board Approval: No
D4 Elaboration: The study used publicly available event-level data with no PII. As such, no IRB approval was required or sought, and this was deemed exempt under our institution’s guidelines.
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Appendix G.
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Appendix E
Author Submission Checklist: yes
Submission Number: 1119
Loading