LLM-Based Persona-Driven Text Data Augmentation

Hyeon Seong Jeong, Han Kyeong Ko, Soo Yong Park, Taehoon Kim

Published: 01 Jan 2025, Last Modified: 30 Mar 2026IEEE AccessEveryoneRevisionsCC BY-SA 4.0

Abstract: Illicit online communication, such as drug-dealing dialogues, is increasingly conducted through covert, context dependent language patterns that evade traditional detection techniques in South Korea. However, developing reliable AI based detection systems remains challenging due to the scarcity of real world training data in such sensitive domains. This paper proposes a novel persona-driven data augmentation framework using Large Language Model(LLM) to generate realistic synthetic drug-dealing dialogues. By encoding domain specific buyer and seller personas along with linguistic behaviour rules, the method produces contextually coherent and semantically diverse dialogues that reflect authentic communication styles. Evaluation results demonstrate that the augmented data preserves key stylistic features (high cosine similarity), maintains lexical diversity (TTR), improves fluency (perplexity), and enhances coherence and lexical richness (ROUGE-L), outperforming traditional augmentation method. Furthermore, statistical validation confirms the semantic consistency and stability of the generated data. These findings highlight the viability of LLM-based augmentation in low-resource, high-risk domains and suggest its potential transferability to other specialized NLP applications requiring context-preserving generation.

External IDs:doi:10.1109/access.2025.3611636