$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Thérien; Charles-Étienne Joseph; Boris Knyazev; Edouard Oyallon; Irina Rish; Eugene Belilovsky

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Learned Optimizer, Meta Generalization, MuP, Maximal Update Parameterization

TL;DR: We propose a new way to meta train learned optimizers, allowing them to generalize from small meta-training tasks to large unseen tasks for the first time.

Abstract: Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($\mu$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $\mu$-parameterized LOs ($\mu$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $\mu$LOs exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 5246

Loading