Keywords: Machine learning, distributed learning, self-supervised learning, vertically-partitioned data
TL;DR: We propose a novel extension of self-supervised learning to vertical federated learning, where unlabeled data is used to train representation networks and labeled data is used to train a downstream prediction network.
Abstract: We consider a system where parties store vertically-partitioned data with a partially overlapping sample space, and a server stores labels on a subset of data samples. Supervised Vertical Federated Learning (VFL) algorithms are limited to training models using only overlapping labeled data, which can lead to poor model performance or bias. Self-supervised learning has been shown to be effective for training on unlabeled data, but the current methods do not generalize to the vertically-partitioned setting. We propose a novel extension of self-supervised learning to VFL (SS-VFL), where unlabeled data is used to train representation networks and labeled data is used to train a downstream prediction network. We present two SS-VFL algorithms: SS-VFL-I is a two-phase algorithm which requires only one round of communication, while SS-VFL-C adds communication rounds to improve model generalization. We show that both SS-VFL algorithms can achieve up to $2\times$ higher accuracy than supervised VFL when labeled data is scarce at a significantly reduced communication cost.
Is Student: Yes