PetroGeoNER: A Refined and Unified Dataset for NER in the Oil & Gas Domain

Published: 2025, Last Modified: 21 Dec 2025STIL 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Resumo Named Entity Recognition (NER) is a task of Natural Language Processing (NLP) that deals with finding and categorizing relevant entities (i.e., word n-grams) in a text, assigning them to predefined semantic categories. The availability of annotated datasets is crucial for developing NER models and assessing their quality. This becomes an issue considering underrepresented languages and specific domains. Furthermore, the word-level annotation required by NER datasets is laborious and prone to inconsistencies. Aiming to contribute to more resources for Portuguese, this paper compiled PetroGeoNER, a NER dataset in the Oil & Gas domain. The process of creating our dataset involved unifying, revising, and solving inconsistencies in two existing datasets. PetroGeoNER was used to train accurate NER models. Both the models and the dataset were made publicly available.
Loading