Semi-Supervised Majority Voting for crowdsourcing

Hao Yu, Shichao Zhang, Jiaye Li, Chengqing Li

Published: 10 Jul 2025, Last Modified: 29 Jan 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Crowdsourced datasets often suffer from missing labels, significantly degrading classifier performance. In this paper, we propose Semi-Supervised Majority Voting (SSMV), a novel framework that integrates semi-supervised learning into the aggregation process to mitigate these effects. First, SSMV partitions the crowdsourcing label matrix into a ‘‘sparse’’ region (with many missing entries) and a ‘‘dense’’ region (with mostly observed labels), yielding two complementary sample sets. Next, it jointly learns a reconstruction coefficient matrix — regularized by an $\ell_{2,1}$-norm to suppress noise and redundancy — by minimizing the discrepancy between the original and reconstructed label matrices. A graph-based Laplacian term preserves the intrinsic manifold structure during reconstruction, while a learned worker-selection vector filters out low-quality annotators. Finally, we apply classic majority voting to the refined label matrix to infer final labels. Extensive experiments on synthetic and real-world datasets demonstrate that SSMV consistently outperforms state-of-the-art crowdsourcing classifiers across multiple metrics. By explicitly modeling the relationship between missinglabel patterns and overall label distributions, SSMV not only recovers missing labels more accurately but also enhances overall classification accuracy. This semi-supervised mechanism is readily extensible to other aggregation algorithms, providing a general strategy for enhancing crowdsourced label quality.