Keywords: Active Learning, Clinical NLP, Named Entity Recognition, Information Extraction, Rare Entity Learning
Abstract: Adverse Drug Events (ADEs) are a major cause of preventable morbidity and mortality, yet ADE mentions in clinical narratives are rare, context-dependent, and often span multiple tokens, which limits the effectiveness of standard active learning (AL) heuristics. We propose a meta-model–driven AL framework that ranks unlabelled sentences by a predicted proxy performance gain (PPG), estimated from uncertainty signals, embedding-level diversity, ontology alignment, and a multi-word ontology match (MWOM) cue. A random forest regressor is used as a surrogate utility model to combine these features.
Evaluated on the n2c2 2018 Track~2 dataset, the proposed method selects 1{,}250 of 3{,}949 candidate sentences (32%), resulting in 5{,}256 annotated sentences in total, compared with 7{,}955 sentences under full-pool annotation. Across five random seeds, the method achieves a mean Micro F1 of 0.9475 (95% CI [0.9469, 0.9480]), while attaining a Macro F1 of 0.7871 in a representative run (seed~42), outperforming the strongest non–meta-model baseline (R-Cos: Micro 0.9311, Macro 0.6334). Substantial gains are observed for rare entity types in the same run, with ADE improving from 0.0352 to 0.3057, Reason from 0.3712 to 0.5689, and Duration from 0.1276 to 0.7230. These results demonstrate that learned, feature-rich acquisition strategies can more effectively prioritise rare, safety-critical entities while substantially reducing annotation requirements.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Clinical and Biomedical Applications, Information Extraction, NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 7033
Loading