Invariant risks without knowledge of the environment

Bratenkov Miron Andreevich; Ivan Bondarenko

Invariant risks without knowledge of the environment

Bratenkov Miron Andreevich, Ivan Bondarenko

Published: 09 Mar 2025, Last Modified: 11 Mar 2025MathAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine learning, ERM, Empirical Risk Minimization, IRM, Invariant Risk Minimization, OOD, Out of distribution, Domain shift

TL;DR: The article discusses a clustering-based approach to modeling the environment in IRM for employing the paradigm in tasks without a priori data partitioning, as well as an approach that simplifies the selection of hyperparameters in the loss function.

Abstract: Generalization under data shifts is one of the challenges in machine learning. The paradigm of invariant risk minimization (IRM) provides an approach that can enhance the generalization capability of models when facing data shifts. Unfortunately, this approach is not without its drawbacks. One of the limiting factors for the widespread application of this paradigm is the requirement to partition the dataset into environments with different distributions. Dividing the dataset into the necessary subsets can often be problematic due to the complexity of the data. To address this issue, we propose a clustering-based approach that allows for the application of the IRM paradigm in any task, even without prior knowledge of the environments. Experiments demonstrate that using clusters as environments and training under the IRM paradigm on these environments improves the robustness of models to data shifts compared to training under the empirical risk minimization (ERM) paradigm. For the weather prediction task, the improvement was 10% in terms of the mean squared error (MSE) metric, while for the pre-training of a decoder model of a small language model (LLM), the increase was 75% on long texts according to the perplexity metric. Furthermore, the paper proposes a modification to the invariant risk minimization paradigms that simplifies the hyperparameter tuning for the penalty term of the error. This modification stabilizes the IRM training process and enhances the robustness of models compared to the traditional hyperparameter tuning for IRM. For the weather prediction task, the improvement was 10% in terms of MSE, and for the pre-training of the decoder model of an LLM, the increase was 460% on long texts in terms of perplexity compared to classical IRM.

Submission Number: 44

Loading