FedUSD: Unbiased Synthetic Data for Federated Learning

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Aggregation-Free Federated Learning enables joint training by sharing synthetic data, aiming to eliminate data heterogeneity across clients. However, existing methods fail to explicitly separate the principal and residual components of dataset, leading to biased synthetic data. In this paper, we propose a novel Unbiased Synthetic Data optimization method FedUSD for Aggregation-Free Federated Learning, which is achieved by exploring the High-energy Orthogonal Base (HOB) and variance of dataset in feature space. Our FedUSD is inspired by the discovery that principal component concentrates in HOB while residual component independently reflects in variance, regardless of networks. Based on the observation, we develop a method that mathematically optimizes synthetic data by matching both HOB and variance with those of real data. Besides, we experimentally show the superior effectiveness of leveraging HOB and variance to separately extract the principal and residual components over existing methods. We also theoretically prove that FedUSD achieves unbiased synthetic data and thus convergence. Without introducing any constraints, FedUSD thereby yields significant improvements over the state-of-the-arts in terms of global model performance, under equivalent communicational costs. For example, on the SVHN dataset, FedUSD improves 6.74\% to 30.82\% which is higher than others with Dirichlet coefficient $\alpha=0.01$.
Lay Summary: Many organizations and devices collect useful data, but they often cannot share it directly because of privacy, ownership, or communication concerns. Federated learning is a way to train a shared AI model while keeping the original data on each client. However, many federated learning methods still require clients to send model updates, and they may work poorly when different clients have very different data. A newer approach is to let each client create a small set of artificial examples and send only these examples to the server. The server then uses them to train the shared model. The main difficulty is that these artificial examples must accurately represent the original data. If they miss important patterns or fail to preserve natural variation, the final model can become biased and less accurate. In this paper, we propose FedUSD, a method for creating better artificial data in this setting. FedUSD separately preserves the main patterns of the data and the remaining variations among samples. In this way, the artificial data can better reflect the real data from different clients. Our experiments show that FedUSD improves the accuracy of the final shared model compared with existing methods under the same communication cost. The improvement is especially clear when clients have very different data distributions.
Link To Code: https://github.com/HCH-XDU/Dataset-Distillation-FL-FedUSD
Primary Area: Applications->Everything Else
Keywords: Federated Learning; Dataset Distillation; Data Heterogeneity
Originally Submitted PDF: pdf
Submission Number: 7834
Loading