Content Based Bot Detection using Bot Language Model and BERT Embeddings

Shubham Kumar, Shivang Garg, Yatharth Vats, Anil Singh Parihar

Published: 01 Jan 2021, Last Modified: 05 Nov 2023ICCCSP 2021Readers: Everyone

Abstract: Micro-blogging sites such as Twitter, with 187 million active users daily, have become an essential source for getting news and trends, sharing information, and gathering various insights. Many user accounts on Twitter are automated with software to share content automatically without human intervention. Such accounts are called bots. However, the open nature of Twitter has resulted in people misusing it for changing the information dynamics, as seen in the US 2016 elections. This research proposes a neural network ensemble of Text CNN and LSTM model with BERT embeddings to classify tweets as bot tweets or not based on the tweets' textual content. Along with this, we handle the problem of data imbalance present in the Twitter dataset, using the Language Model Oversampling Technique to create a bot language model that will mimic bot tweets over the traditional techniques like Synthetic Minority Oversampling technique. A comparison and analysis of the tweets generated by the above two methods is made. The comparison with the various state-of-the-art model shows that the proposed model outperforms the previous models.

0 Replies