CPWS: Confident Programmatic Weak Supervision for High-Quality Data Labeling

Published: 01 Jan 2025, Last Modified: 29 Sept 2025ACM Trans. Inf. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Programmatic Weak Supervision (PWS) is a recent data labeling paradigm, which employs several Labeling Functions (LFs) to provide weak labels and involves a Label Model (LM) for label aggregation. Despite the significant progress, there still remain some inherent challenges in PWS. From the view of labeling, LFs may wrongly label some data points. From the view of data, some data points themselves may be low-quality (e.g., ambiguous texts or blurred images). These largely stem from the lack of an explicit evaluation mechanism for LFs or data points. To this end, inspired by confident learning focusing on label quality, we propose a Confident PWS (CPWS) approach for high-quality data labeling. Specifically, several LFs are firstly utilized to provide weak labels for unlabeled data. Then, we develop an explicit Dual Evaluation Mechanism (DEM) to evaluate the quality of both LFs and data points, which not only employs data to evaluate trained models but also leverages trained models to evaluate data. Along this line, we further design a Distribution-Guided Pruning Strategy (DPS) to prune low-quality data and aggregate weak labels under the guidance of label class distribution. Extensive experiments on various benchmark datasets demonstrate the effectiveness and generalization ability of our proposed approach.
Loading