You Only Train Once: Gradient-Based Loss Hyperparameter Optimization

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: optimization, hyperparameters, losses, weights, regularization, gradient descent
TL;DR: We optimize loss weight hyperparameters of neural networks in one shot via standard backpropagation by treating them as regular parameters of the networks.
Abstract: The process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. Our method, You Only Train Once (YOTO), contributes to limiting training to one shot w.r.t. the latter aspect of selection and weighting of losses. It achieves this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously. We model this formulation as a novel layer, parameterized with a softmax operation which satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used grid search and random search across transformers and CNNs for two representative vision tasks, i.e. the regression task of 3D estimation and the classification task of semantic segmentation, and showing that it consistently outperforms both grid and random search on unseen test data at a fraction of the compute.
Primary Area: optimization
Submission Number: 14207
Loading