Harmfulness Detection of Text Generated by  Large Language Models

Wang Xin

Harmfulness Detection of Text Generated by Large Language Models

Wang Xin

28 Feb 2025 (modified: 01 Mar 2025)XJTU 2025 CSUC SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model; text generation; harmfulness detection; pre-trained model; fine-tuning

Abstract: This research is dedicated to solving the problem of harmfulness detection of text generated by large language models. To this end, a comprehensive data set——HateAI-100, is carefully constructed that contains both harmful text generated by humans and harmful text generated by large language models. This study selected DistilBERT as a pre-training model and fine-tuned it through the IMDB data set, aiming to improve its ability to detect harmful text. Experimental results show that the fine-tuned DistilBERT model performs excellently in the harmfulness detection task of human-generated text, with an accuracy rate as high as 88%. However, when dealing with harmful text generated by large language models, its performance dropped significantly, with an accuracy of only 72%. This finding not only reveals the powerful ability of large language models to generate complex and covertly harmful content, but also highlights the current shortage of harmfulness detection technology for text generated by large language models. Looking to the future, this study recommends deepening the understanding of harmful text generated by large language models, continuing to expand the data set containing harmful text generated by large language models, and actively exploring more advanced deep learning models and algorithms for harmfulness detection of text generated by large language models

Submission Number: 18

Loading