DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem

Published: 01 Jan 2024, Last Modified: 04 Sept 2025ECML/PKDD (10) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As the AI revolution unfolds, the push toward automating support systems in diverse professional fields ranging from open-source software to healthcare, and banking to transportation has become more pronounced. Central to the automation of these systems is the early detection of named entities, a task that is foundational yet fraught with challenges due to the need for domain-specific expert annotations amid a backdrop of specialized terminologies, making the process both costly and complex. In response to this challenge, our paper presents an innovative named entity recognition (NER) framework (https://github.com/NeuralSentinel/DistALANER) tailored for the open-source software domain. Our method stands out by employing a distantly supervised, two-step annotation process that cleverly exploits language heuristics, bespoke lookup tables, external knowledge bases, and an active learning model. This multifaceted strategy not only elevates model performance but also addresses the critical hurdles of high costs and the dearth of expert annotators. A notable achievement of our approach is its capability to enable pre-large language models (pre-LLMs) to significantly outperform specially designed generic/domain specific LLMs for NER tasks. We also show the effectiveness of NER in the downstream task of relation extraction.
Loading