Abstract: Extracting events from news stories as the aim of several Natural Language Processing (NLP) applications (e.g., question answering, news recommendation, news summarization) is not a trivial task, due to the complexity of natural language and the fact that news reporting is characterized by journalistic style and norms. Those aspects entail scattering an event description over several sentences within one document (or more documents), applying a mechanism of gradual specification of event-related information. This implies a widespread use of co-reference relations among the textual elements, conveying non-linear temporal information. In addition to this, despite the achievement of state-of-the-art results in several tasks, high-quality training datasets for non-English languages are rarely available. This paper presents our preliminary study to develop an annotated Dataset for Italian Crime Event news (DICE). The contribution of the paper are: (1) the creation of a corpus of 10,395 crime news; (2) the annotation schema; (3) a dataset of 10,395 news with automatic annotations; (4) a preliminary manual annotation using the proposed schema of 1000 documents. The first tests on DICE have compared the performance of a manual annotator with that of single-span and multi-span question answering models and shown there is still a gap in the models, especially when dealing with more complex annotation tasks and limited training data. This underscores the importance of investing in the creation of high-quality annotated datasets like DICE, which can provide a solid foundation for training and testing a wide range of NLP models.
Loading