Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: pretraining, optimization, implicit bias
TL;DR: Implicit biases can achieve "same pretraining loss, better downstream task" in modern LLM pre-training.
Abstract: The foundational capabilities of large language models are acquired during pretraining on internet-scale, highly heterogeneous data mixtures. In this work, we investigate a geometric question regarding the converged state of this pretraining process: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration} (b)), or merely a minimizer of the averaged loss (e.g., \cref{fig:cwa_illustration} (a))? We hypothesize that the geometric ``closeness'' of task-specific minima is intrinsically linked to downstream generalization. However, we reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the \textit{Nexus} optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters demonstrate that Nexus boosts downstream performance, despite achieving nearly \textit{the same pretraining loss} (see \cref{fig:demo:benchmark}). On the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). Our findings challenge the reliance on pretraining loss as the sole proxy for model evaluation and highlight the critical role of optimizer implicit bias in unlocking downstream generalization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 16
Loading