Abstract: Semi-supervised learning (SSL) algorithms often struggle to perform well when trained on imbalanced data. In such scenarios, the generated pseudo-labels tend to exhibit a bias toward the majority class, and models relying on these pseudo-labels can further amplify this bias. Existing imbalanced SSL algorithms explore pseudo-labeling strategies based on either pseudo-label refinement (PLR) or threshold adjustment (THA), aiming to mitigate the bias through heuristic-driven designs. However, through a careful statistical analysis, we find that existing strategies are suboptimal: most PLR algorithms are either overly empirical or rely on the unrealistic assumption that models remain well-calibrated throughout training, while most THA algorithms depend on flawed metrics for pseudo-label selection. To address these shortcomings, we first derive the theoretically optimal form of pseudo-labels under class imbalance. This foundation leads to our key contribution: SEmi-supervised learning with pseudo-label optimization based on VALidation data (SEVAL), a unified framework that learns both PLR and THA parameters from a class-balanced subset of training data. By jointly optimizing these components, SEVAL adapts to specific task requirements while ensuring per-class pseudo-label reliability. Our experiments demonstrate that SEVAL outperforms state-of-the-art SSL methods, producing more accurate and effective pseudo-labels across various imbalanced SSL scenarios while remaining compatible with diverse SSL algorithms.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: **1. Rewritten the "Related Work" section**, which now includes:
* Detailed discussions of related works such as FlexMatch and Dash, along with an analysis of their relationship to the proposed method.
* An update to the references, adding 8 more recent works from the last two years.
**2. Streamlined the "Methods" section** by moving algorithms and less critical analyses to the appendix.
**3. Extended and relocated the implementation details** for threshold learning to the "Methods" section.
**4. Further clarified the experimental setup**, specifying that half of the training set was used for curriculum optimization. The size of the split training dataset (i.e., $\boldsymbol{n}$/2) is now provided for all experiments.
Assigned Action Editor: ~Feng_Liu2
Submission Number: 6142
Loading