Active Learning Design Choices for NER with Transformers

Robert Vacareanu, Enrique Noriega-Atala, Gus Hahn-Powell, Marco Antonio Valenzuela-Escárcega, Mihai Surdeanu

Published: 2024, Last Modified: 03 Jun 2024LREC/COLING 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We explore multiple important choices that have not been analyzed in conjunction regarding active learning for token classification using transformer networks. These choices are: (i) how to select what to annotate, (ii) decide whether to annotate entire sentences or smaller sentence fragments, (iii) how to train with incomplete annotations at token-level, and (iv) how to select the initial seed dataset. We explore whether annotating at sub-sentence level can translate to an improved downstream performance by considering two different sub-sentence annotation strategies: (i) entity-level, and (ii) token-level. These approaches result in some sentences being only partially annotated. To address this issue, we introduce and evaluate multiple strategies to deal with partially-annotated sentences during the training process. We show that annotating at the sub-sentence level achieves comparable or better performance than sentence-level annotations with a smaller number of annotated tokens. We then explore the extent to which the performance gap remains once accounting for the annotation time and found that both annotation schemes perform similarly.