Identification of tweets that mention books

Shuntaro Yada, Kyo Kageura, Cécile Paris

2020 (modified: 09 Dec 2021)Int. J. Digit. Libr. 2020Readers: Everyone

Abstract: We address the task of identifying tweets that mention books from amongst tweets that contain the same strings as book titles. Assuming the existence of a comprehensive list of book titles, this task can be defined as text classification targeting tweets that contain the same string as book titles. In carrying out the task, we need to exclude two types of tweets. The first is automatically posted, spam-like tweets that promote book sales or post recommendations (bot tweets). This type of tweets is excluded because we are developing an online surrogate to book exposure embedded within human communication on social media, and the results of the present task are to be used in this system. The second is tweets that contain the same string as book titles but are not about books (noise tweets). We proposed a two-step, machine learning-based pipeline consisting of bot filtering followed by noise reduction. Evaluation of experiments showed that our proposed method achieved an F1-score of 0.76, which is comparable to the best performance reported in similar tasks and sufficient as a first step for use in practical applications. We also analysed the detailed performance and errors, which suggested that the proposed method maintained an appropriate balance between precision and recall, and can be further improved by increasing the data size and taking into account word senses.

0 Replies