Training Subset Selection for Weak Supervision

Hunter Lang; Aravindan Vijayaraghavan; David Sontag

Training Subset Selection for Weak Supervision

Hunter Lang, Aravindan Vijayaraghavan, David Sontag

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: weak supervision, data editing, subset selection, weakly-supervised learning

TL;DR: We show how to use pretrained representations to select high-quality subsets of weakly labeled training data. Training with these subsets improves the performance of weak supervision.

Abstract: Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/training-subset-selection-for-weak/code)

15 Replies

Loading