ULF: UNSUPERVISED LABELING FUNCTION CORRECTION USING CROSS-VALIDATION FOR WEAK SUPERVISION

Anastasiia Sedova; Benjamin Roth

ULF: UNSUPERVISED LABELING FUNCTION CORRECTION USING CROSS-VALIDATION FOR WEAK SUPERVISION

Anastasiia Sedova, Benjamin Roth

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: nlp, weak supervision, text classification, sentiment analysis

Abstract: A way to overcome expensive and time-consuming manual data labeling is weak supervision - automatic annotation of data samples via a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the classes associated with the LFs. In this work, we investigate noise reduction techniques for weak supervision based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for denoising weakly annotated data which uses models trained on all but some LFs to detect and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. We realize two variants of this algorithm: feature-based ULF (relying on count-based feature vectors), and DeepULF (fine-tuning pre-trained language models). We compare ULF to methods originally developed for detecting erroneous samples in manually annotated data, as well as to our extensions of such methods to the weakly supervised setting. Our new weak supervision-specific methods (ULF and extensions) leverage the information about matching LFs, making detecting noisy samples more accurate. Evaluation on several datasets shows that ULF can successfully improve weakly supervised learning without utilizing any manually labeled data.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

TL;DR: We introduce a new algorithm ULF for denoising weakly annotated data based on the principle of k-fold cross-validation. ULF uses models trained on all but some LFs to detect and correct biases specific to the held-out LFs.

Supplementary Material: zip

9 Replies

Loading