Unbiased Loss Functions for Multilabel Classification with Missing Labels

Erik Schultheis; Rohit Babbar

Unbiased Loss Functions for Multilabel Classification with Missing Labels

Erik Schultheis, Rohit Babbar

Published: 01 Sept 2025, Last Modified: 01 Sept 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper considers binary and multilabel classification problems in a setting where labels are missing independently and with a known rate. Missing labels are a ubiquitous phenomenon in extreme multi-label classification (XMC) tasks, such as matching Wikipedia articles to a small subset out of the hundreds of thousands of possible tags, where no human annotator can possibly check the validity of all the negative samples. For this reason, propensity-scored precision---an unbiased estimate for precision-at-k under a known noise model---has become one of the standard metrics in XMC. Few methods take this problem into account already during the training phase, and all of these are limited to loss functions that can be decomposed into a sum of contributions from each individual label. A typical approach to training is to reduce the multilabel problem into a series of binary or multiclass problems, and it has been shown that if the surrogate task should be consistent for optimizing recall, the resulting loss function is not decomposable over labels. Therefore, this paper develops unbiased estimators for generic, potentially non-decomposable loss functions. These estimators suffer from increased variance and may lead to ill-posed optimization problems, which we address by switching to convex upper-bounds. The theoretical considerations are further supplemented by an experimental study showing that the switch to unbiased estimators significantly alters the bias-variance trade-off and thus requires stronger regularization.

Submission Length: Long submission (more than 12 pages of main content)

Code: https://github.com/xmc-aalto/missing-labels-tmlr

Assigned Action Editor: ~Takashi_Ishida1

Submission Number: 4497

Loading