Abstract: Collaboration between edge devices has the potential to dramatically scale up machine learning (ML) through access to an unprecedented quantity of data. Federated learning (FL) is a collaborative algorithm in which clients learn from each other without sharing their private data. However, edge devices tend to have various data distributions since they are naturally exposed to different data sources. This heterogeneity of the data, also known as non-IID data distributions, degrades the performance of FL. We study how data sharing among users can mitigate this performance degradation. Data sharing among users can occur naturally in a social group (e.g., friends, colleagues, and family) or can be incentivized by the platform based on different criteria. We test the performance gains of data sharing for several standard ML models and common datasets. We empirically show that modest data sharing between members of a social group among the different experiments boosts the learning performance significantly for the non-IID case. We also show that data sharing can, surprisingly, boost performance for the IID case. By normalizing the dataset sizes, we verify that this performance boost is significant even if data sharing does not increase the number of data points per client. Data sharing is thus a simple and efficient technique for improving FL.
Loading