Enhancing Hate Speech Classifiers through a Gradient-assisted Counterfactual Text Generation Strategy

ACL ARR 2025 May Submission5887 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Counterfactual data augmentation (CDA) is a promising strategy for improving hate speech classification, but automating counterfactual text generation remains a challenge. Strong attribute control can distort meaning, while prioritizing semantic preservation may weaken attribute alignment. We propose Gradient-assisted Energy-based Sampling (GENES) for counterfactual text generation, which restricts accepted samples to text meeting a minimum BERTScore threshold and applies gradient-assisted proposal generation to improve attribute alignment. Compared to other methods that solely rely on either prompting, gradient-based steering, or energy-based sampling, GENES is more likely to jointly satisfy attribute alignment and semantic preservation under the same base model. In effect, using GENES as a counterfactual generator for data augmentation may improve out-of-domain performance of hate speech classifier while, at the minimum, maintaining the in-domain performance. Based on our cross-dataset evaluation, the average performance of models aided by GENES is the best among those methods that rely on a smaller model (Flan-T5-L). On the other hand, using similar augmentation techniques that rely on larger models (GPT-4o-mini) is slightly more robust based on average performance. Nonetheless, the results with GENES are comparable, making it a possible lightweight and open-source alternative.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Data Augmentation, Counterfactual Text Generation, Hate Speech Detection
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5887
Loading