Addressing Data Heterogeneity Through a Pre-learned Manifold for Distributed Learning Scenarios

Addressing Data Heterogeneity Through a Pre-learned Manifold for Distributed Learning Scenarios

ICLR 2026 Conference Submission16287 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: machine learning, Distributed Machine Learning, Federated Learning

TL;DR: We propose the concept of precision collaboration which leverage a pre-learned manifold to avoid negative transfer in federated learning.

Abstract: In distributed learning environments like federated learning, data heterogeneity across clients has been a key challenge, which often leads to suboptimal model performance and convergence issues. So far, plenty of efforts have focused on addressing data heterogeneity by relying on a hypothetical clustering structure or a consistent information-sharing mechanism. However, because of the inherent complexity and diversity of real-world data, these assumptions may be largely violated. In this work, we argue that information sharing is mostly fragmented in the collaboration network in reality. The distribution overlaps are not consistent but scattered among local clients. We propose the concept of Precision Collaboration, which refers to accurately identifying the informative data in other clients precisely while carefully avoiding the potential negative transfer induced by others. In particular, we propose to pre-learn a global manifold, which infers the local data manifolds and estimates the exact local data density simultaneously. The learned manifold aims to precisely identify the shared data in other clients. The estimated exact likelihood allows for generating samples from the manifold precisely. Our pre-training strategy enables reusable and scalable model learning, especially when an ongoing influx of new clients becomes part of the network. Experiments show that our proposed method effectively identifies the favorable data in other clients without compromising privacy preservation, and significantly overcomes baselines on benchmarks and a real-world clinical data set.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 16287

Loading