Semi-Automatic Dataset Annotation Applied to Automatic Violent Message Detection

Beatriz Botella-Gil, Robiert Sepúlveda-Torres, Alba Bonet-Jover, Patricio Martínez-Barco, Estela Saquete

Published: 2024, Last Modified: 14 Jun 2024IEEE Access 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Annotated corpora are indispensable tools to train computational models in Artificial Intelligence and Natural Language Processing. However, manual annotation is a costly, arduous, and time-consuming task, especially when the annotation is semantically complex. To address the problem, this work applies a methodology for semi-automatic annotation of datasets based on the Human-in-the-Loop paradigm. The methodology supports the building of a resource, that benefits from a fine-grained annotation, to aid in the detection of Spanish violent messages sourced from social media (Twitter/X). After implementing the proposed methodology for semi-automatic violence annotation, a high quality resource was obtained (hereafter referred to as VILLANOS). The methodology consists of annotating the dataset incrementally, which delivers an increase in annotator efficiency, thereby validating the suitability of the proposal. Annotation time was reduced by 52% compared to manual annotation and performance, by training a model with the VILLANOS dataset, obtains an $F_{1}$ of 85.2%. These results demonstrate the efficiency and effectiveness of the methodology, evidencing its validity.