TL;DR: How you can perfectly infer hidden test set labels from just common ML loss scores (which are possibly noised)?
Abstract: Label inference was recently introduced as the problem of reconstructing the ground truth labels of a private dataset from just the (possibly perturbed) cross-entropy loss scores evaluated at carefully crafted prediction vectors. In this paper, we generalize this result to provide necessary and sufficient conditions under which label inference is possible from a broad class of loss functions. We show that for many commonly used loss functions, including linearly decomposable losses, some Bregman divergence-based losses and when common activation functions are used, it is possible to design such attacks for arbitrary noise levels. We demonstrate that these attacks can also be carried out through a lightweight augmentation to any neural network model, enabling the adversary to make these attacks look benign. Our results call to attention these vulnerabilities which might be currently under silent exploitation. Armed with this information, individuals and organizations, which vend these seemingly innocuous aggregate metrics from their classification models, can grasp the potential scope of the resulting information leakage.
Paper Under Submission: The paper is currently under submission at NeurIPS