HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

ACL ARR 2025 July Submission729 Authors

28 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Explaining why content is hateful using natural language is crucial for fostering transparency in automated content moderation systems. However, evaluating the quality of such explanations remains an open challenge. General-purpose reward models (RMs), commonly used for scoring natural language outputs, are typically optimized for broad notions of safety. As a result, they tend to penalize necessary references to stereotypes or offensive framing, elements that are essential for faithful hate speech explanations. To address this gap, we introduce SBIC-Explain, a dataset of 370,788 LLM generated NLEs for offensive content, spanning three levels of human-annotated contextual richness: Tier 1: text-only, Tier 2: + classification-aware, and Tier 3: + semantics-informed. We hypothesize that as human-annotated context increases, explanations should better reflect human preferences. Yet, we find that existing RMs systematically assign lower scores to more contextually rich (and often more offensive) explanations, revealing a misalignment between model preferences and explanatory fidelity for this context. To mitigate this, HARM (Hate-Aware Reward Model) is proposed, a RM that integrates interpretable signals to better align reward scores with the needs of hate speech explanation. HARM significantly outperforms general-purpose baselines, improving preference accuracy from 0.66 to 0.80.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: free-text/natural language explanations, explanation faithfulness, model bias/fairness evaluation, hate-speech detection, evaluation methodologies, human-centered evaluation, NLP datasets, safety and alignment, sparse models

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: english

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Software: zip

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Limitations

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 3, 4, 5

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Appendix A

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: 4, Appendix A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B4 Elaboration: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: 3, 4

B6 Statistics For Data: Yes

B6 Elaboration: 4

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix L

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 5, Appendix G, Appendix H, Appendix I, Appendix J

C3 Descriptive Statistics: Yes

C3 Elaboration: 5, G, Appendix H, Appendix I, Appendix J, Appendix K

C4 Parameters For Packages: Yes

C4 Elaboration: 4, 5

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Appendix M

Author Submission Checklist: yes

Submission Number: 729

Loading