Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

ACL ARR 2025 July Submission792 Authors

28 Jul 2025 (modified: 22 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.

Paper Type: Long

Research Area: Human-Centered NLP

Research Area Keywords: human factors in NLP, values and culture, human-centered evaluation

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Previous URL: https://openreview.net/forum?id=SfMoAphDly

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

Justification For Not Keeping Action Editor Or Reviewers: The Area Chair rVgp may lack expertise in the area, they request us to "use multiple annotators to annotate human reasoning" and then "compare model reasoning with human reasoning". However, it is impossible to hire another group of annotators to explain the subjective decisions made by earlier annotators. Furthermore, they criticize the methodological novelty of our paper. However, this is an evaluation paper instead of a methodological one. Our evaluation reveals the existence of a critical problem, rather than proposing new methodology..

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Ethics Statements

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 4, 5, 6, Appendix D, E

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Ethics Statements

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Ethics Statements

B4 Data Contains Personally Identifying Info Or Offensive Content: Yes

B4 Elaboration: Ethics Statements

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 4, Appendix A, D, E

B6 Statistics For Data: Yes

B6 Elaboration: Seciton 4, Appendix A

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 4, Appendix D

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 5, Appendix D, E

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 6, Appendix F G

C4 Parameters For Packages: Yes

C4 Elaboration: Appendix D, E

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 792

Loading