Framing Bias in Arithmetic Reasoning: How Language and Identity Cues Steer LLM Outputs in Objective Tasks

ACL ARR 2025 July Submission797 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLMs are expected to reason reliably over objective, verifiable facts, especially in contrast to subjective or open-ended tasks. We introduce MATHCOMP, a diagnostic benchmark comprising over 29,000 prompted instances derived from 300 controlled arithmetic comparison scenarios, systematically varied across 14 linguistic framings and multiple demographic identity conditions (e.g., ``a woman'', ``a Black person''). Across six LLMs and multiple prompting formats, we observe consistent framing bias, i.e., systematic, directional shifts in model predictions caused by terms like more, less, or equal, even when logically redundant. Demographic references further amplify these shifts. Chain-of-thought prompting reduces framing effects in free-form outputs, though structured reasoning formats can reintroduce bias by echoing prompt cues. MATHCOMP reveals how even grounded, symbolic tasks are shaped by linguistic and social framing, expanding the evaluation of LLM robustness and ultimately fairness beyond standard accuracy metrics and common benchmarks focused on affective or identity-laden content.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: math QA, logical reasoning, model bias/fairness evaluation, mathematical NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Previous URL: https://openreview.net/forum?id=gPRyCsUtYl
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: We respectfully request a new Action Editor and new reviewers for this submission. The paper has undergone extensive revisions, including updated analyses, additional section(s), and restructuring of the framing and writing, making it effectively a new submission. To ensure a fresh and unbiased evaluation of the current version, we believe that a new set of reviewers and AC with relevant expertise would provide the most constructive and fair assessment.
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 3
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: The dataset would be open source and publicly available
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3
B6 Statistics For Data: Yes
B6 Elaboration: Main paper and in the appendix
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 4
C3 Descriptive Statistics: Yes
C3 Elaboration: Sections 5, 6, and 7
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix A.4 Annotation and Filtering
D2 Recruitment And Payment: No
D2 Elaboration: the task was not subjective.
D3 Data Consent: Yes
D3 Elaboration: The annotators were among the authors.
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: No
D5 Elaboration: The task is objective and we do not collect or report the demographics of the annotators
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Section 3
Author Submission Checklist: yes
Submission Number: 797
Loading