Effectiveness of Local Steps on Heterogeneous Data: An Implicit Bias View

Effectiveness of Local Steps on Heterogeneous Data: An Implicit Bias View

ICLR 2026 Conference Submission14765 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: local steps, distributed learning, implicit bias, overparameterization

TL;DR: We analyze the implicit bias of Local-GD and show that the aggregated global model resultant from Local-GD with arbitrary number of local steps converges exactly to the centralized model in direction.

Abstract: In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method to mitigate communication burden, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD). In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with {\em linearly separable data}. In the case of highly heterogeneous data, it has been observed empirically that local models can diverge significantly from each other (also known as "client drift''). However, for the interpolation regime, our analysis shows that the aggregated global model resultant from Local-GD with arbitrary number of local steps converges exactly to the model that would result in if all data were in one place (centralized trained model) in direction. Our result gives the exact rate of convergence to the centralized model with respect to the number of local steps. We also obtain this same implicit bias with a learning rate independent of number of local steps with a Modified Local-GD algorithm. Our analysis provides a new view to understand why Local-GD can still work very well with a very large number of local steps even for heterogeneous data. Lastly we also discuss the extension of our results to Local SGD and non-separable data.

Primary Area: optimization

Submission Number: 14765

Loading