layout: post title: The Risks and Rewards of Invariant Risk Minimization tags: [Machine Learning, Out-Of-Distribution Generalization, Causality] authors: Anonymous

The Risks and Rewards of Invariant Risk Minimization

Spurious correlations are one of the most prominent pain points for building and deploying machine learning models. While formally defining these can be challenging, intuitive examples abound. A canonical example is image classification of cows versus camels. If the majority of the images containing cows in the training set have grass in the background while the images of camels have sand in the background, then the learned classifier may simply use the the background color in order to make predictions rather than the properties of the animal itself. At test time, if predictions are made on images of cows on beaches, performance will badly degrade.

More impactful examples can be found in the medical domain. Oakden-Rayner et al studied the CRX-14 dataset made up of thousands of images of chest X-rays. They found that 80% of the images labelled with pneumothorax (a collapsed lung) contained a chest drain, but no other visual signs of the condition. Chest drains are a common form of treatment. There are serious consequences if a classifier trained on the dataset only classifies a pneumothorax in the presence of a chest drain and cannot detect untreated pneumothorax.

Spurious correlations arise in many other settings as well, such as natural language processing e.g. (Gururangan et al., 2018; Clark et al., 2019).

In this blog post, we describe popular and exciting ways to train invariant models that seek to solve spurious correlations. While promising, recent work (Rosenfeld et al., 2021) studies the types of fundamental guarantees that such approaches can offer, showing that these techniques may fail in a range of reasonable settings. Nevertheless, the promise of such approaches demands further study.

cow camel
Google Image Search Results for "cow" vs "camel"

1. Thinking Environmentally: A Cure for Spurious Correlations

Invariant risk minimization (IRM) (Arjovsky et al.,2019) is a learning paradigm aimed at reducing the effect of spurious correlations, building on earlier work on causal inference using invariant prediction (Peters et al., 2016). The idea is to replace vanilla empirical risk minimization (ERM)—which simply minimizes the training error averaged over the training samples—with an invariant formulation. The motivation to replace ERM is natural: ERM may result in learning all the correlations found in the training data, even those that are not related to the causal relationship of interest.

To deal with this obstacle, we need a way to inject information about such relationships into the training process. One way to do this is to assume that data may come from a number of different “environments”. Intuitively, if these environments are different, we may be able to distinguish between invariant and non-invariant features, enabling us to train models that generalize even on unseen environments.

What changes when the environment changes? In the animal example, the environmental features can be thought of as the background, while in the medical example, the environmental features include whether or not a chest drain is present. Note that features (background, the presence of treatment) are not causes of the variable of interest.

More formally, suppose we have training environments $\varepsilon_{tr} = {e_1,e_2,…,e_E}$ where each environment $e_i$ invokes a distribution $p^{e_i}$ over features and labels $(x,y)$. When does ERM fail in this setting? Consider the following model from Arjovsky et al., 2019:

\[X_1 \leftarrow \text{Gaussian}(0,\sigma^2)\\ Y \leftarrow X_1 + \text{Gaussian}(0,\sigma^2) \\ X_2 \leftarrow Y + \text{Gaussian}(0,1)\]

Across different environments, the equations for $X_1$ and $X_2$ may change, as well as the parameter $\sigma_2$. The true causal relationship is between $X_1$ and $Y$. If we regress $Y$ on $(X_1,X_2)$ using least squares regression, which is equivalent to solving

\[\min_{\hat{\alpha}_1,\hat{\alpha}_2}\mathbb{E}[(Y-\hat{\alpha}_1X_1-\hat{\alpha}_2X_2)^2],\]

we get the solution

\[\hat{Y} = \frac{1}{\sigma^2+1}X_1 + \frac{\sigma^2}{\sigma^2+1}X_2.\]

Therefore, for environments with large variance $\sigma^2$, our estimator $\hat{Y}$ will weigh $X_2$ more heavily than $X_1,$ but this will perform poorly on environments with small $\sigma^2$, and so, as expected, this approach is not invariant.

Given training data from multiple environments, a simple variation of ERM is to minimize the maximum risk across all environments in $\varepsilon_{tr},$ meaning minimize $\max_{e\in\varepsilon_{tr}}R^e(f).$ Here, $R^e = \mathbb{E}_{x,y \sim p^e}R(f(x),y)$ is the risk of environment $e$. However, it can be shown (see Arjovsky et all., 2019) that this is essentially the same as minimizing a weighted average of the training environment risks, so our predictor will still assign large weight to $X_2$ if training environments have large variance. Another solution is needed.

Returning to our intuition that invariant features are meaningful across environments, we want to learn an estimator $f(x)$ that performs well on all environments. We can think of this as minimizing

\[R^{OOD}(f)= \max_{e \in \varepsilon_{all}\supset\varepsilon_{tr}}R^e{(f)}\]

2. IRM To The Rescue?

To recap, our goal is to learn a classifier that performs well on environments we have seen during training as well as environments we have not seen but where the causal relationship remains consistent.

Define the risk of an environment $e$ as

\[\mathcal{R}^e(\Phi,\beta) := \mathbb{E}_{(x,y)\sim p^e}\bigg[l(\sigma(\beta^T\Phi(x)),y)\bigg]\]

for a feature embedder $\Phi$ and classifier $\beta$. Then, the IRM objective is

\[\min_{\Phi,\hat{\beta}} \frac{1}{|\varepsilon|}\sum_{e\in\varepsilon}\mathcal{R}^e(\Phi,\hat{\beta}) \; \text{ s.t. } \; \hat{\beta} \in\text{argmin}_\beta \mathcal{R}^e(\Phi,\beta) \; \forall e\in\varepsilon \tag{1}.\]

Intuitively, the idea is to find a feature representation such that the optimal classifier over those features is the same for every environment. This should encourage the learner to only use invariant features since a featurizer that considers environmental features should have different optimal classifiers across environments. However, the objective (1) is hard to solve in practice since each constraint is itself an optimization problem. Therefore, it is replaced by the approximate objective

\[\min_{\Phi,\hat{\beta}} \frac{1}{|\varepsilon|} \sum_{e\in\varepsilon}\bigg[\mathcal{R}^e(\Phi,\hat{\beta}) + \lambda ||\nabla_{\hat{\beta}}\mathcal{R}^e(\Phi,\hat{\beta})||^2_2 \bigg]\]

This objective is obtained by rewriting (1) as a penalized objective

\[L_{IRM}(\Phi,\hat{\beta}) = \sum_{e \in\varepsilon_{tr}}R^e(\Phi,\hat{\beta}) + \lambda \mathbb{D}(\Phi,\hat{\beta},e)\]

where $\mathbb{D}(\Phi,\hat{\beta},e)$ measures how close $\hat{\beta}$ is to minimizing $R^e(\Phi,\cdot).$ Taking $\mathbb{D}(\Phi,\hat{\beta},e)$ to be the squared gradient norm is a logical choice. See Arjovsky et al for more details.

3. IRM Is Not A Magic Bullet

The formulation above seems great. By simply modifying our loss function, we hope to encourage our algorithm to learn only on the invariant features. Unfortunately, this is a bit too good to be true. Rosenfeld et al show that under reasonable conditions IRM may fail.

In order to distinguish non-variant and invariant features, let use assume a data model where $x= f(z_c,z_e)$ for an injective function $f$ and where $z_c$ denotes the invariant features and $z_e$ denotes the environmental features. $z_c$ may only depend on $y$ while $z_e$ may depend on the environment as well. In other words, the relationship between $y$ and $z_c$ remains the same across all environments while the relationship between $y$ and $z_e$ may change. Returning to the animal classification example, if $x$ is a picture of a cow, we can think of $z_c$ as encoding the features of the cow, and $z_e$ as encoding the features of the background. If $z_c$ and $z_e$ represent the pixels explicitly then we could take $f$ to be the identity map. This is not always possible. To see why, consider a colored version of the MNIST dataset:

Colored MNIST (Nam et al., 2020)

This is just the MNIST dataset with images from each class being colored by a unique color. This color is then (perfectly) correlated with the label, but if the test environment permutes the colors, the correlation is spurious. In this setting, we can think of $z_c$ as representing the original gray scale image, while $z_e$ encodes the color. $f$ can no longer be the identity map, since the raw features—the pixels—all contain color information. Instead, $f$ is a function combining pixel locations and color. In general, only $x$ is observed, while $z_c,z_e$ are hidden. It is the job of $\Phi$ to recover the optimal representation of the latent features. Ideally $\Phi$ produces a representation of just $z_c$.

To simplify the setting, Rosenfelt et al consider the following data generation model:

\[y = \begin{cases} 1 & \text{w.p. } \eta\\ -1 & \text{otherwise} \end{cases}\] \[z_c \sim \mathcal{N}(y\cdot \mu_c,\sigma_c^2I), \; z_e \sim \mathcal{N}(y\cdot \mu_e,\sigma_e^2I)\]

with $\mu_c \in \mathbb{R}^{d_c}$, $\mu_e \in \mathbb{R}^{d_e}$ They then show that if $f$ and $\Phi$ are linear, IRM may fail if the number of environments $E \leq d_e.$ This is a very reasonable setting, since we often expect to have data collected from only a few distinct environments, while the dimension $d_e$ is often large.

To understand this failure case, consider the figure below with $d_e =E=2$.

Linear Case with $d_e=E=2$. Dotted lines represent optimal decision boundaries over environmental features

In this case, by simply projecting the environmental features onto the first coordinate, the optimal classifier over non-invariant and invariant features will be the same for both environments and it will achieve lower risk on the training set than the best classifier that only uses invariant features. However, if we test this classifier on a new environment $e_3$ where the relationship between $y$ and $z_e$ is reversed, e.g. $\mu_{e_3} = -\mu_{e_2},$ we would expect the “optimal” classifier to perform poorly. In general, the precise choice of $\Phi$ will depend on the variances $\sigma^2_e$, but the idea is that when $E$ is small relative to $d_e$, it’s possible to find a linear transformation $\Phi$ such that the optimal classifier over $\Phi(x)$ is the same across all environments.

Rosenfeld et al also provides an example of when IRM can fail in the nonlinear case as well. Specifically, they define \(\mathcal{B}_r = \big[\cup_{e\in\varepsilon} B_r(\mu_e)\big] \bigcup \big[\cup_{e\in\varepsilon} B_r(-\mu_e)\big]\) i.e. the union of balls with radius $r$ around the environmental means and choose

\[\Phi(x) = \begin{cases} \begin{bmatrix} z_c \\ 0 \end{bmatrix} & z_e \in \mathcal{B}_r\\ \begin{bmatrix} z_c \\ z_e \end{bmatrix} & z_e \notin \mathcal{B}_r\\ \end{cases}, \;\;\; \hat{\beta} = \begin{bmatrix} \beta_c\\ \hat{\beta}_e\\\beta_0 \end{bmatrix}\]

where $[\beta_c,\beta_0]$ is the optimal invariant classifier and $\hat{\beta}_e$ is the ERM classifier over the environmental features. Basically, this classifier behaves as the optimal invariant classifier in $\mathcal{B}_r$ and as ERM in $\mathcal{B}_r^c.$ Therefore, by choosing $r$ sufficiently large, the IRM objective will be small due to Gaussian concentration. However, if a test environment has an environmental mean far from those of the training set, the classifier will perform similarly to ERM, which contradicts the purpose of IRM.

Interestingly, similar arguments can be applied to other objective functions that have been proposed in addition to IRM such as (Krueger et al., 2021; Xie et al., 2020).

4. The Future of Invariance-Aided Risk Minimization

These concerns raise important questions about invariant prediction and how approaches relying on it should be formulated. In some sense, the results in Rosenfeld et al., 2021 are intuitive. We expect that a classifer cannot be guaranteed to perform well on test environments if it has not been exposed to similar environments at training time. Therefore, it remains important to think about what we can reasonably expect from an invariant learning framework or what assumptions we may need to impose in order to achieve theoretical guarantees.

On the other hand, it is not clear how much counterexamples such as those mentioned above play into IRM’s empirical behavior on much more realistic and complex datasets. Gulrajani et al compared a number of domain generalization algorithms including IRM and found that none of them substantially improve upon ERM. While in some sense pessimistic, this result is still mysterious as well. Are empirical weaknesses due to behavior similar to that in the theoretical constructions above, or other reasons? Are they fixable? These are exciting questions demanding further investigation.

A further question is whether we should seek a generic invariant “replacement” for ERM at all! A small amount of human participation may well enable significant improvements in invariance. For example, suppose we have no access to any points from the test-time environment in color MNIST. Without many environments, ERM will struggle. However, we can easily ask a human user for what type of information is likely to be causal or non-causal; even a non-expert user can specify that color (or line thickness) is not causal. Finding ways to inject this type of simple human suggestion is another promising future direction.

Takeaways

References

  1. Arjovsky, Martin, et al. “Invariant risk minimization.” arXiv preprint arXiv:1907.02893 (2019).

  2. Clark, Christopher, Mark Yatskar, and Luke Zettlemoyer. “Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases.” arXiv preprint arXiv:1909.03683 (2019).

  3. Gulrajani, Ishaan, and David Lopez-Paz. “In search of lost domain generalization.” arXiv preprint arXiv:2007.01434 (2020).

  4. Gururangan, Suchin, et al. “Annotation artifacts in natural language inference data.” arXiv preprint arXiv:1803.02324 (2018).

  5. Krueger, David, et al. “Out-of-distribution generalization via risk extrapolation (rex).” International Conference on Machine Learning. PMLR, 2021.

  6. Nam, Junhyun, et al. “Learning from failure: Training debiased classifier from biased classifier.” arXiv preprint arXiv:2007.02561 (2020).

  7. Oakden-Rayner L. Exploring Large-scale Public Medical Image Datasets. Acad Radiol. 2020 Jan;27(1):106-112. doi: 10.1016/j.acra.2019.10.006. Epub 2019 Nov 6. PMID: 31706792.

  8. Peters, Jonas, Peter Bühlmann, and Nicolai Meinshausen. “Causal inference by using invariant prediction: identification and confidence intervals.” Journal of the Royal Statistical Society. Series B (Statistical Methodology) (2016): 947-1012.

  9. Rosenfeld, E., Ravikumar, P., and Risteski, A. The risks of invariant risk minimization. In International Conference on Learning Representations, 2021.

  10. Xie, Chuanlong, et al. “Risk variance penalization: From distributional robustness to causality.” arXiv e-prints (2020): arXiv-2006.