Hounding Data Diversity: Towards Participant Selection in Vertical Federated Learning

Xiaokai Zhou, Xiao Yan, Fangcheng Fu, Xinyan Li, Hao Huang, Quanqing Xu, Chuanhui Yang, Bo Du, Tieyun Qian, Jiawei Jiang

Published: 2025, Last Modified: 27 Jan 2026ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Due to the rising concerns on privacy protection, how to build machine learning models from distributed databases with privacy guarantees has gained more popularity. Vertical federated learning (VFL) trains machine learning models in a privacy-preserving way when the data features are scattered over distributed databases. We study the participant selection problem (PSP) for VFL, which chooses a given number of participants to conduct training while maximizing model accuracy. Compared to training with all participants, PSP can filter out hitch-riders that contribute marginally to model quality and reduce training time by involving fewer participants. To achieve good model accuracy, we formulate PSP as choosing a set of participants that maximizes the likelihood of the data samples. Then, utilizing the k-nearest neighbors (KNN) classifier as the proxy model, we express the likelihood as a function of the selected participants and prove that the function is sub modular. The submodular property is favorable as it can account for the feature diversity among the participants and allows to greedily select the participant with the maximum gain in each step. However, the selection process requires finding the top-k neighbors of a data sample as the basic operation, which is expensive in VFL setting as it involves encrypted communication. As such, we adapt the Fagin's algorithm, a famous top-k query algorithm, to reduce the amount of encrypted communication. We deploy our solution VFPS-SM across five distributed nodes and conduct experiments with 10 datasets and 3 models to evaluate its performance. The results show that VFPS-SM can reduce the end-to-end running time by up to $35\times$, selection time $365\times$ and improve model accuracy by 6.0% compared with state-of-the-art baselines.

External IDs:dblp:conf/icde/ZhouYFLHXYDQJ25