We Generate What You Need: Efficient Data Supplement via Model Prediction Discrepancy for Heterogeneous Federated Learning

13 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Federated Learning, Distributed Machine Learning
TL;DR: We propose a data synthesis framework that identifies local-global prediction gaps to generate targeted samples, effectively mitigating data heterogeneity in federated learning
Abstract: One emerging approach to mitigating data heterogeneity in Federated Learning (FL) is to employ diffusion models to generate synthetic data for clients, thereby aligning local data distributions with the global distribution. Prior work has primarily focused on balance-oriented augmentation, which assumes a balanced global class distribution and thus generates samples of rare classes to rebalance each client's local dataset. However, in practice, global data distributions are often inherently imbalanced. For example, in weather forecasting, certain regions naturally experience more rainy days than sunny days, resulting in inherently imbalanced global training and testing data for those regions. Moreover, privacy and communication constraints in FL hinder the server’s ability to accurately estimate the global distribution, rendering balance-oriented augmentation suboptimal. This raises a key, underexplored challenge: How can synthetic data be generated and selected to align local distributions with the true, yet unknown, global distribution? To address this challenge, we propose a novel framework, FedDPD. The key insight behind our approach is that a model’s performance implicitly reflects the data distribution it has been trained on. Based on this observation, we use the performance discrepancy between the local and global models to identify the regions where each client’s local dataset is lacking, and supplement corresponding synthetic samples for clients. Furthermore, we adapt the diffusion model for each client through a preference-optimization paradigm, enabling it to generate data that better aligns with the true global distribution, addressing the specific gaps in the client’s local data. Notably, our approach incurs no additional computational overhead for clients. Extensive experiments on multiple benchmarks demonstrate that FedDPD outperforms state-of-the-art methods, achieving up to 3.82\% improvement, regardless of the global distribution’s balance.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 4621
Loading