Data Augmentation for Messenger Phishing Detection Using Large Language Models

Keonwoong Noh, Seokjin Oh, Sunyub Kim, Dohyung An, Woohwan Jung

Published: 2025, Last Modified: 06 Jan 2026ICEIC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this study, we introduce a new dataset specifically designed for detecting messenger phishing, an increasingly significant issue in cybercrime. To overcome the scarcity of labeled phishing data, we employ large language models (LLMs) to generate synthetic data, thereby expanding the dataset and improving detection capabilities. Our experimental results show that a model trained exclusively on synthetic data performs comparably to those trained with labeled data. Furthermore, combining synthetic data with labeled data achieves superior F1 and accuracy scores compared to using labeled data only while reducing misclassification errors.

External IDs:dblp:conf/elinfocom/NohOKAJ25