Efficiently labelling sequences using semi-supervised active learning

Harshil Shah; David Barber

Efficiently labelling sequences using semi-supervised active learning

Harshil Shah, David Barber

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Abstract: In natural language processing, deep learning methods are popular for sequence labelling tasks but training them usually requires large amounts of labelled data. Active learning can reduce the amount of labelled training data required by iteratively acquiring labels for the data points a model is most uncertain about. However, active learning methods usually use supervised training and ignore the data points which have not yet been labelled. We propose an approach to sequence labelling using active learning which incorporates both labelled and unlabelled data. We train a locally-contextual conditional random field with deep nonlinear potentials in a semi-supervised manner, treating the missing labels of the unlabelled sentences as latent variables. Our semi-supervised active learning method is able to leverage the sentences which have not yet been labelled to improve on the performance of purely supervised active learning. We also find that using an additional, larger pool of unlabelled data provides further improvements. Across a variety of sequence labelling tasks, our method is consistently able to match 97% of the performance of state of the art models while using less than 30% of the amount of training data.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=Kg06cC2uVB

5 Replies

Loading