Efficient Heterogeneity-Aware Federated Active Data Selection

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Federated Active Learning (FAL) aims to learn an effective global model, while minimizing label queries. Owing to privacy requirements, it is challenging to design effective active data selection schemes due to the lack of cross-client query information. In this paper, we bridge this important gap by proposing the \underline{F}ederated \underline{A}ctive data selection by \underline{LE}verage score sampling (FALE) method. It is designed for regression tasks in the presence of non-i.i.d. client data to enable the server to select data globally in a privacy-preserving manner. Based on FedSVD, FALE aims to estimate the utility of unlabeled data and perform data selection via leverage score sampling. Besides, a secure model learning framework is designed for federated regression tasks to exploit supervision. FALE can operate without requiring an initial labeled set and select the instances in a single pass, significantly reducing communication overhead. Theoretical analyze establishes the query complexity for FALE to achieve constant factor approximation and relative error approximation. Extensive experiments on 11 benchmark datasets demonstrate significant improvements of FALE over existing state-of-the-art methods.
Lay Summary: Training effective AI models usually requires large amounts of labeled data, which is costly and challenging to gather, especially when data privacy regulations restrict direct data sharing among institutions. Existing techniques often fail to efficiently select the most useful data points when data is distributed unevenly across multiple clients. To address this, we developed Federated Active data selection by Leverage score sampling (FALE), a novel method combining federated Learning with active Learning. FALE utilizes a privacy-preserving federated singular value decomposition to understand data distribution securely across different clients. Based on this analysis, FALE employs a leverage-score sampling strategy to select globally informative data points efficiently. It further securely trains a robust global AI model using these selected data points, without compromising client privacy. Our FALE significantly reduces redundant labeling efforts and enhances the accuracy of AI models in decentralized environments. Experiments on multiple benchmark datasets demonstrate that FALE consistently outperforms existing methods, achieving better performance with fewer labeled data points. Thus, FALE makes decentralized AI training more practical, efficient, and privacy-aware, with broad implications for secure and collaborative machine learning applications.
Primary Area: General Machine Learning
Keywords: Federated learning, Active learning, Leverage score sampling
Submission Number: 9377
Loading