Aligning Distributionally Robust Optimization with Practical Deep Learning Needs

Dmitrii Feoktistov; Igor Ignashin; Andrey Veprikov; Nikita Borovko; Aleksandr Bogdanov; Savelii Chezhegov; Aleksandr Beznosikov

Aligning Distributionally Robust Optimization with Practical Deep Learning Needs

Dmitrii Feoktistov, Igor Ignashin, Andrey Veprikov, Nikita Borovko, Aleksandr Bogdanov, Savelii Chezhegov, Aleksandr Beznosikov

15 Sept 2025 (modified: 28 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: optimization, distributionally robust optimization, deep learning

TL;DR: We propose Adaptive Distributionally Robust Optimizer for DL, prove its convergence in non-convex scenario and provide evaluation

Abstract: Deep learning (DL) models often struggle with real-world data heterogeneity, such as class imbalance or varied data sources, as standard training methods treat all samples equally. Distributionally Robust Optimization (DRO) offers a principled approach by optimizing for a worst-case data distribution. However, a significant gap exists between DRO and current DL practices. DRO methods often lack adaptive parameter updates (like Adam), struggle with the non-convexity of neural networks, and are difficult to integrate with group-based weighting in standard mini-batch training pipelines. This paper aims to bridge this gap by introducing ALSO -- Adaptive Loss Scaling Optimizer -- a novel optimizer that integrates an adaptive, Adam-like update for the model parameters with an efficient, principled mechanism for learning worst-case data weights. Crucially, it supports stochastic updates for both model parameters and data weights, making it fully compatible with group-based weighting and standard Deep Learning training pipelines. We prove the convergence of our proposed algorithm for non-convex objectives, which is the typical case for DL models. Empirical evaluation across diverse Deep Learning tasks characterized by different types of data heterogeneity demonstrates that ALSO outperforms both traditional DL approaches and existing DRO methods.

Primary Area: optimization

Submission Number: 5789

Loading