HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content
Abstract: Explaining why content is hateful using natural language is crucial for fostering transparency in automated content moderation systems. However, evaluating the quality of such explanations remains an open challenge. General-purpose reward models (RMs), commonly used for scoring natural language outputs, are typically optimized for broad notions of safety. As a result, they tend to penalize necessary references to stereotypes or offensive framing, elements that are essential for faithful hate speech explanations.
To address this gap, we introduce SBIC-Explain, a dataset of 370,788 LLM generated NLEs for offensive content, spanning three levels of human-annotated contextual richness: Tier 1: text-only, Tier 2: + classification-aware, and Tier 3: + semantics-informed. We hypothesize that as human-annotated context increases, explanations should better reflect human preferences. Yet, we find that existing RMs systematically assign lower scores to more contextually rich (and often more offensive) explanations, revealing a misalignment between model preferences and explanatory fidelity for this context.
To mitigate this, HARM (Hate-Aware Reward Model) is proposed, a RM that integrates interpretable signals to better align reward scores with the needs of hate speech explanation. HARM significantly outperforms general-purpose baselines, improving preference accuracy from 0.66 to 0.80.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: free-text/natural language explanations, explanation faithfulness, model bias/fairness evaluation, hate-speech detection, evaluation methodologies, human-centered evaluation, NLP datasets, safety and alignment, sparse models
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: english
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
Software: zip
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Limitations
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 3, 4, 5
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Appendix A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 4, Appendix A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B4 Elaboration: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 3, 4
B6 Statistics For Data: Yes
B6 Elaboration: 4
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix L
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 5, Appendix G, Appendix H, Appendix I, Appendix J
C3 Descriptive Statistics: Yes
C3 Elaboration: 5, G, Appendix H, Appendix I, Appendix J, Appendix K
C4 Parameters For Packages: Yes
C4 Elaboration: 4, 5
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Appendix M
Author Submission Checklist: yes
Submission Number: 729
Loading