Improved Coresets for Vertical Federated Learning: Regularized Linear and Logistic Regressions

Supratim Shit; Gurmehak kaur chadha; Surendra kumar; Bapi Chatterjee

Improved Coresets for Vertical Federated Learning: Regularized Linear and Logistic Regressions

Supratim Shit, Gurmehak kaur chadha, Surendra kumar, Bapi Chatterjee

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Novel Coresets for Regularized Linear and Logistic Regressions in VFL.

Abstract: Coreset, as a summary of training data, offers an efficient approach for reducing data processing and storage complexity during training. In the emerging vertical federated learning (VFL) setting, where scattered clients store different data features, it directly reduces communication complexity. In this work, we introduce coresets construction for regularized logistic regression both in centralized and VFL settings. Additionally, we improve the coreset size for regularized linear regression in the VFL setting. We also eliminate the dependency of the coreset size on a property of the data due to the VFL setting. The improvement in the coreset sizes is due to our novel coreset construction algorithms that capture the reduced model complexity due to the added regularization and its subsequent analysis. In experiments, we provide extensive empirical evaluation that backs our theoretical claims. We also report the performance of our coresets by comparing the models trained on the complete data and on the coreset.

Lay Summary: A theoretically proven better approach for training the regularized versions of the logistic regression and linear regression methods in a data-efficient way when implemented in a setting that allows the partitioning of the feature space of the training data. In this setting, the parties to which the feature space of data is partitioned are able to maintain data privacy. Such a setting is an excellent fit for a consortium of organizations working in domains such as finance, healthcare, etc., where they have to maintain information privacy about their customers. The data efficiency comes via an approach where only the important samples out of the entire training sample set are selected. The regularization of the methods ensures that the trained models will not be suitable only for the samples on which they are trained, but also for previously unseen samples. Such an approach is the first in its class of methods to the best of the knowledge of the co-authors.

Link To Code: https://github.com/dcll-iiitd/CoresetForVFL

Primary Area: General Machine Learning->Scalable Algorithms

Keywords: Vertical Federated Learning, Coresets, Logistic Regression, Ridge Regression, Regularization, Regularized Sensitivity Score

Submission Number: 11116

Loading