Keywords: large language model; text generation; harmfulness detection; pre-trained model; fine-tuning
Abstract: This research is dedicated to solving the problem of harmfulness detection of text generated by
large language models. To this end, a comprehensive data set——HateAI-100, is carefully constructed that
contains both harmful text generated by humans and harmful text generated by large language models. This
study selected DistilBERT as a pre-training model and fine-tuned it through the IMDB data set, aiming to
improve its ability to detect harmful text. Experimental results show that the fine-tuned DistilBERT model
performs excellently in the harmfulness detection task of human-generated text, with an accuracy rate as
high as 88%. However, when dealing with harmful text generated by large language models, its performance
dropped significantly, with an accuracy of only 72%. This finding not only reveals the powerful ability of
large language models to generate complex and covertly harmful content, but also highlights the current
shortage of harmfulness detection technology for text generated by large language models. Looking to the
future, this study recommends deepening the understanding of harmful text generated by large language
models, continuing to expand the data set containing harmful text generated by large language models, and
actively exploring more advanced deep learning models and algorithms for harmfulness detection of text
generated by large language models
Submission Number: 18
Loading