Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation

Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation

ACL ARR 2026 January Submission8754 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: synthetic data generation, harmful content, user personas, large language models, content moderation

Abstract: Static benchmarks for harmful content detection suffer from data contamination and fail to capture the evolving nature of online toxicity. To address these limitations, we propose a synthetic harmful content generation framework leveraging persona-guided large language model (LLM) agents. Our approach constructs two-dimensional user personas by integrating demographic identities and topical interests with situational harmful strategies, enabling the simulation of diverse and contextually grounded harmful interactions. Both human and LLM-based evaluations confirm that our framework achieves a high success rate in generating valid harmful content. Experiments across multiple detection systems reveal that our synthetic scenarios are more challenging to detect than those in existing benchmarks. Furthermore, a multi-faceted analysis confirms that our approach achieves linguistic and topical diversity comparable to human-curated datasets, establishing our framework as an effective tool for robust evaluation of harmful content detection systems. Our code is available at \url{https://anonymous.4open.science/r/synthesizing_harmful_content-122E}.

Paper Type: Long

Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good

Research Area Keywords: hate-speech detection, NLP tools for social analysis

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 8754

Loading