%!TEX root = ../sublime-text.tex
\label{sec:preliminaries}
% \textbf{Invariant Risk Minimization (IRM):}
% \abcomment{This section has no subsections ... that makes it harder to read. Try breaking into sub-sections or use para headings at least a few times}
\subsection{IRM Setting}
We denote a training set $\cD = \{\cD^e\}_{e\in \cE}$ composed of environmental training datasets $\cD^e \coloneqq \{(\vx^e_i, y_i )\}^{n_e}_{i=1}$, $\vx^e_i \in \cX \subseteq \RR^p, y_i \in \cY \subseteq \RR$. Each point 
% $\vx^e_i$ and label $y_i$ pair 
is drawn i.i.d.~from an environmental distribution $P^e(\vx^e,y)$.
Each environmental dataset $\cD^e$ has $n_e$ points for a total of $n = \sum_{e\in \cE} n_e$ points in total.
% We denote the total number of training environments $E \coloneqq \ds{\cE_{tr}} \ge 2$. 
% \abdelete{IRM aims to find a predictor $f : \cX \to \RR$ that performs well even on unseen environments $e_{te} \in \cE$, when $\cE_{tr}\subset \cE$. } 
In the IRM paradigm outlined by \citet{arjovsky2020invariant}, the goal is to find a predictor $f : \cX \to \cY$, defined as $f (\vx) = \vv^\top \Phi(\vx)$, with a linear component $\vv \in \RR^{d}$ and a feature extractor 
% $\Phi: \cX \to \cH$. 
$\Phi: \cX \to \RR^{d}$. 
% For simplicity, we let $d' = d$.
% \abcomment{This can lead to confusion. Are we using $\Phi(x)=x$? Otherwise, are we restricting the model in some way?}
% \abcomment{For the generative model in (2), the $x$ features may really correspond in $\Phi(x)$ for a layered model ... we should make this clear early on.}
% \abcomment{why would the risk not depend on $\Phi(x)$ in general?} 

The mapping $\Phi$ is said to be invariant if there exists a $\vv$ such that $f(\vx)$
is minimized across all environments simultaneously.
Specifically, we define the population risk $\cR^e(\vv) = R^e(\vv ^\top \Phi(\vx^e))= \EE^e[\ell(f(\vx^e), y)]$ 
and empirical risk $\hat \cR^e(\vv) = \sum_{i=1}^{n_e} \ell(f(\vx^e_i), y_i)$, per environment.
% Overloading the notation of $\Phi$ to also represent the output of the featurizer $\Phi(\vx)$, 
The IRM formulation looks for the best $(\Phi, \vv)$ that minimizes the following constrained problem:
\begin{equation}
\label{eqn:irm}
\begin{aligned}
  &\min _{\substack{\Phi: \mathcal{X} \rightarrow \RR^d \\ \vv: \RR^d \rightarrow \mathcal{Y}}} 
  & & \sum_{e \in \mathcal{E}} R^e(\vv ^\top \Phi
  (\vx^e)
  ), \\
  &\text{subject to} 
  & & \vv \in \underset{\vv^e: \RR^d \rightarrow \mathcal{Y}}{\arg \min } R^e\ps{(\vv^e)^\top \Phi(\vx^e)}
   \quad \forall e \in \mathcal{E}.
\end{aligned}
\end{equation}
% \jdcomment{Should we just delete this? Maybe not if want to draw connection to minimax}
We consider a generative model in the style of previous lines of work in IRM \citep{rosenfeld2020risks, ahuja_2021_irm_ib_bottleneck, zhouSparseInvariantRisk2022} that explicitly has invariant, environmental (``non-invariant''), and random features. 
As coined by \citet{ahuja2022invariance}, these are confounder, or anti-causal, models in which $P^{e_1}(y |  \Phi(\vx^e) ) \ne P^{e_2}(y | \Phi(\vx^e) )$ if $\Phi(\vx^e) = \vx^e$.
% \abcomment{use of $I$ is unclear, $\Phi$ is a function}. 

\subsection{Data Generation}
\label{sub:data_generation}
% and consider the following model:
% We rename the variable $\Phi $
% We are specifically interested in finding a linear $f$ 
%  Previous works  have provided analyses that show failure cases of IRM, even in the linear case, and we extend the overparameterized linear model in \citet{zhouSparseInvariantRisk2022} specifically to highlight the hazard of overparameterization.
% \jdcomment{\citep{ahuja2022invariance}  does a nonasymptotic analysis, but on polynomial hypothesis class, and need a polynomial number of training environments.
% Although the original IRM presented 
% Theoretical works in IRM have pointed out flaws both in regression and cla
% can fail when certain assumptions on the data generation process are not met 
% In order to explore the non-asymptotic regime of the overparamterization-prone 
% % We consider an anti-causal setting \abcomment{has not been defined; also, what is the motivation?} 
% % Although previous works \citep{arjovsky2020invariant,rosenfeld2020risks} have shown the efficacy of IRM 
% Many works immediately following \citep{arjovsky2020invariant} have begun to examien the 
% There are some that specifically look at sample analysis. \citep{lin2023spurious, zhouSparseInvariantRisk2022}. 
% The latter \citep{zhouSparseInvariantRisk2022}  analyzes the case where samples are limited, but are missing one part in the analysis.
% We build on the problem setting introduced by \citet{zhouSparseInvariantRisk2022}, 
% a regression model with strong correlation between spurious features \abcomment{strictly speaking, those are not spurious features, see footnote in the Risks of IRM paper} and label compared to previous IRM settings \citep{arjovsky2020invariant, rosenfeld2020risks, ahuja2022invariance}.
% We will present a non-asymptotic analysis of the overparameterized regime. 
IRM struggles to discover invariant data representations in the overparameterized regime, where the number of model parameters exceeds the size of the training set \citep{li2018overparam,pmlr-v97-allen-zhu19a}. Even in the simple linear model introduced by \citet{zhouSparseInvariantRisk2022},
% ---where the label is determined by a small number of invariant features and influenced by many environmental and random features---
% , which we state in \Cref{eqn:problem-setting},
unmodified IRM fails to recover the underlying invariant structure.
Because the data representation $\Phi(\vx^e)$ may not completely isolate the invariant features, we are interested in finding the subset of invariant features in the data representation. 

We let $\vx^e = \Phi(\vx^e{})$, and work directly with the representation. 
This reflects the interpretation that $\vx^e$ is the output of the all-but-last layer of a deep neural network, which may have captured non-invariant features.
% Then, we use a generalized version of their model to illustrate adjusted, correct bounds for the sample complexity of sparsity constrained invariant learning.
Then, for a given sample $(\vx^e,y)$ drawn from any environment $e\in\cE$, write the feature vector as a concatenation of invariant, spurious, and random feature blocks, i.e., $(\vx^e)^\top = [\vx_{\inv}^\top, (\vx_{s}^e)^\top, \vx_{r}^\top]$, for $\vx^e \in \RR^d$ and $\vx_{\inv} \in \RR^{d_\inv}, \vx_{s}^e\in \RR^{d_s},  \vx_{r}\in \RR^{d_r}$, where $d = d_\inv + d_s + d_r$.
Although the term ``spurious" formally refers to features that are not caused by the label yet share a strong correlation with it \citep{rosenfeld2020risks}, we adopt it as the common nomenclature for features that are caused by the label for clarity.
We use the superscript $e$ to denote a \textit{dependency} on the environment to which the feature belongs; any variable (i.e., $\vx_{\inv},  \vx_{r}$) that does not have the superscript indicates that it is independent of the environment. 
The dependencies are illustrated in \Cref{fig:gen-model}.
We use $\odot$ to represent the Hadamard (element-wise) product between two vectors of the same length $d$, i.e., $(\vv \odot \vw)_i = v_i w_i\ \forall i \in [d]$. 
% For a given sample $(\vx^e_i,y_i)$ we can write $(\vx^e_i)^\top = [\vx_{i,\inv}^\top, (\vx_{i,s}^e)^\top, \vx_{i,r}^\top]$, for $\vx^e \in \RR^d$ and $\vx_{i,\inv} \in \RR^{d_\inv}, \vx_{i,s}^e\in \RR^{d_s},  \vx_{i,r}\in \RR^{d_r}$. When ob
The generative model is as follows: 
% \abcomment{ideally, we want to draw as graphical model}
\begin{equation}
\label{eqn:problem-setting}
\begin{aligned}
& y=\gamma^{\top} \vx_{\inv}+\epsilon_{\inv}~, \\
& \vx_s^e=y \vzeta_s+\valpha^e \odot \vepsilon_{s}~, \\
& \vx_r=\vzeta_r \odot \vepsilon_{r}~.
\end{aligned}
\end{equation}

\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=0.13]
\tikzstyle{every node}+=[inner sep=0pt]
\draw [black] (3.5,-16.2) circle (2.9);
\draw (3.5,-16.2) node {$\vx_\inv$};
\draw [black] (15.9,-16.2) circle (2.9);
\draw (15.9,-16.2) node {$\vx_s^e$};
\draw [black] (28.1,-16.2) circle (2.9);
\draw (28.1,-16.2) node {$\vx_r$};
\draw [black] (3.5,-3.1) circle (2.9);
\draw (3.5,-3.1) node {$y$};
\draw [black] (15.9,-3.1) circle (2.9);
\draw (15.9,-3.1) node {$e$};
\draw [black] (3.5,-13.3) -- (3.5,-6);
\fill [black] (3.5,-6) -- (3,-6.8) -- (4,-6.8);
\draw [black] (5.49,-5.21) -- (13.91,-14.09);
\fill [black] (13.91,-14.09) -- (13.72,-13.17) -- (12.99,-13.86);
\draw [black] (15.9,-6) -- (15.9,-13.3);
\fill [black] (15.9,-13.3) -- (16.4,-12.5) -- (15.4,-12.5);
\end{tikzpicture}
\caption{Causal relationship between observed variables in the generative model by \citet{zhouSparseInvariantRisk2022}. All variables are observed at training time, but the type of an individual feature $\vx^e$ is unknown.}
\label{fig:gen-model}
\end{center}
\end{figure}
As discussed in \citet{zhouSparseInvariantRisk2022}, the label $y$ is generated from a fixed vector $\gamma \in \RR^{d_\inv}$, which is invariant across environments.
Spurious features depend on both the labels as well as the environment; the variable $\valpha^e \in \RR^{d_s}$ controls the environment-dependent noise in each spurious feature and $\epsilon_\inv, \vepsilon_s, \vepsilon_r $ are independent noise variables added to the system. 
We assume they are sub-Gaussian and centered,
and we are interested in the regime in which $d_s, d_r$ are very large.
% and we are concerned with scaling with large $d_s,d_r$.
Additional scaling parameters $\vzeta_s \in \RR^{d_s}$ and $\vzeta_r \in \RR^{d_r}$. $\vzeta_s$ control the strength of the correlation between a spurious feature $[\vx^e_s]_j$ for $j\in [d_s]$
% \abcomment{what is $x_i$?} 
and the label, $y$. 
Likewise, $\vzeta_r$ determines the scale of the random features. 
\jdcomment{These variables not only provide more variation across different features and environments, but show up in the generalization gap in a way that we cover more in Section~4.}
% \abdelete{These modifications from the the model provided by \citet{zhouSparseInvariantRisk2022} allow for more variation across different features and environments.} 
We also assume the basic noise random variables $\epsilon_\inv, \vepsilon_s, \vepsilon_r $ are sub-Gaussian.

% \begin{remark}
% If we assume $\vzeta_s = \vone^{d_s}$ and $\vzeta_r = \vone^{d_r}$, we get the original linear model by \citet{zhouSparseInvariantRisk2022}. 
% However, this will yield sample complexity and estimation error bounds which are dimension-dependent, i.e., dependent on $d_{\inv}, d_s$, and $d_r$. 
% % \jdcomment{Not exactly. This dependency came from $\err(1/\delta, n)$, which is part of the $\le 0$ from our analysis now.}
% {\color{blue} To motivate variable $\vzeta_s$ as an example, consider that for $d_s$ features, the size of the data $\|\vx^e\|_2$ is $O(\sqrt{d_s})$ when $\vzeta_s = \vone$.
% If we instead let the scaling parameter $\vzeta_s$ be changed, we allow different spurious features to correlate differently with labels. 
% In addition to being a substantially more realistic assumption on the data, it allows us to create scale-dependent bounds. 
% }
% This cost that may be reduced to as low as $O(\frac{d_\inv}{d_s + d_r})$  when instead generating the data with a fixed $\|\vzeta_s^e\|_2^2$, and the complete case analysis is discussed in detail in \Cref{sec:technical}.
% \end{remark}

% \abedit{The reason for introducing the scaling parameters is that we want to assume the basic noise random variables $\epsilon_\inv, \vepsilon_s, \vepsilon_r $ are sub-Gaussian. Then, assuming $\vzeta_s$ and $\vzeta_r$ be all ones vectors will yield sample complexity and estimation error bounds which are dimensionality dependent, i.e., dependent on $d_{\inv}, d_s$, and $d_r$. Instead, our analysis will demonstrate that such bounds are in fact scale dependent, e.g., scale of the correlation, scale of the noise, etc., which reduces to dimensionality under simple special cases, e.g., all ones vectors.} 
% \abcomment{the previous bit can be a remark}

% \jddelete{while also limiting the size of data, $\Ds{\vx^e}_2$}
% \abcomment{reader has no idea at this point why this norm is relevant to anything. This will lead to confusion. If we want to discuss this here, we have to discuss how scaling of features matter in say (standard) least squares regression, what happens if we do/do not do z-scoring of features, how the scaling aspect has played out }.
% \jdcomment{Expand more on ``size" vs ``dimensionality" here?}
% Assume $\Ds{\gamma}_2 = 1$, $\Ds{\vzeta_s}_2 = c_s\ \forall e \in \cE$, $\Ds{\valpha^e}_2= c_a\ \forall e \in \cE$, and $\Ds{\vzeta_r}_2 = c_r$  for positive constants $c_s, c_a, c_r$. 

% \jdcomment{The variables table \Cref{tbl:params} is too big, should I kill columns to make it fit?}
\begin{table}
\caption{{List of variables for the generative model with invariant features. \mbox{*} indicates newly introduced variables. Column header Dim. is short for dimensionality.}}
\label{tbl:params}
\begin{center}
\begin{tabular}{lllll} 
\toprule
    \textbf{Variable}  & \textbf{$L_2$ norm} & \textbf{Dim.} & \textbf{Definition} \\ \midrule
    $\gamma$ & 1 & $d_\inv$ & Ground truth \\ 
    $\epsilon_\inv$  & -  & $ 1 $& Ground truth \\ 
    $\vzeta_s$ & $c_s$ & $d_s $ & *Label correlation \\ 
    $\valpha^e$ & $c_a$ & $d_s$ & Spurious noise \\
    $\vepsilon_s$ & - & $d_s$ & Sub-Gaussian noise\\
    $\vzeta_r$  & $c_r$ & $d_r$ & *Noise scale
     \\
    $\vepsilon_r$  & - & $d_r$ & Sub-Gaussian noise
    %  \\\midrule
    % $\vx_\inv$ & Yes  & 1& $d_\inv$ & Invariant feature block \\ 
    % $\vx_s$  & No & See \Cref{lemma:norm-x} & $d_s$ & Spurious feature block \\
    % $\vx_r$ & Yes & See \Cref{lemma:norm-x} & $d_r$& Random feature block \\
    \\\bottomrule
    % 1.373 & -146.6 & -137.6 \\
    %  0.343 & 133.2  & 152.4  \\
    %  0.119 & 168.5  & -161.1 \\
    %  0.08  & 25.6   & 90     \\ \midrule
    %  0.097 & -175.6 & -114.7 \\
    %  0.063 & 22.3   & 122.5  \\
    %  0.039 & 141.6  & -122   \\
    %  0.04  & -35.7  & 90     \\ \midrule
    %  0.045 & 133.3  & -106.3 \\
    %  0.034 & -69.4  & 110.9  \\
    %  0.025 & 92.3   & -109.3 \\ \bottomrule
\end{tabular}
\end{center}
\end{table}

% \begin{table*}
% \caption{List of some generative model parameters. \mbox{*} indicates new features introduced in this work.}
% \label{tbl:params}
% \begin{center}
% \begin{tabular}{lllll} 
% \toprule
%     \textbf{Variable}  & \textbf{Invariant} & \textbf{$L_2$ norm} & \textbf{Length} & \textbf{Definition} \\ \midrule
%     $\gamma$ & Yes & 1 & $d_\inv$ & Ground truth \\ 
%     $\epsilon_\inv$ & Yes & -  & $ 1 $& Ground truth \\ 
%     $\vzeta_s$ & Yes & $c_s$ & $d_s $ & *Label correlation in spurious \\ 
%     $\valpha^e$ &  No & $c_a$ & $d_s$ & Spurious noise controller\\
%     $\vepsilon_s$ &  No & - & $d_s$ & Sub-Gaussian noise\\
%     $\vzeta_r$ & Yes & $c_r$ & $d_r$ & *Noise scale
%      \\
%     $\vepsilon_r$ & Yes & - & $d_r$ & Sub-Gaussian noise
%     %  \\\midrule
%     % $\vx_\inv$ & Yes  & 1& $d_\inv$ & Invariant feature block \\ 
%     % $\vx_s$  & No & See \Cref{lemma:norm-x} & $d_s$ & Spurious feature block \\
%     % $\vx_r$ & Yes & See \Cref{lemma:norm-x} & $d_r$& Random feature block \\
%     \\\bottomrule
%     % 1.373 & -146.6 & -137.6 \\
%     %  0.343 & 133.2  & 152.4  \\
%     %  0.119 & 168.5  & -161.1 \\
%     %  0.08  & 25.6   & 90     \\ \midrule
%     %  0.097 & -175.6 & -114.7 \\
%     %  0.063 & 22.3   & 122.5  \\
%     %  0.039 & 141.6  & -122   \\
%     %  0.04  & -35.7  & 90     \\ \midrule
%     %  0.045 & 133.3  & -106.3 \\
%     %  0.034 & -69.4  & 110.9  \\
%     %  0.025 & 92.3   & -109.3 \\ \bottomrule
% \end{tabular}
% \end{center}
% \end{table*}
\subsection{Selecting Invariant Features}
% \jdcomment{workshop subsection name}
Presented with a large feature vector dominated by spurious and random features, we want to find a model $f(\vx^e) = \vv ^\top \vx^e$ that is invariant across $\vx^e$ drawn from different $e\in \cE$ as in \Cref{eqn:problem-setting}. 
By the IRM paradigm \citep{arjovsky2020invariant, rosenfeld2020risks}, this can only be achieved if $f(\vx^e) $ depends only on invariant features, 
i.e., the support of $\vv$ is a subset of the features $\vx_\inv^e$. Thus, we formulate the problem in terms of subsets of features.
% we are interested in examining which features are used in the final predictor, $\vv$

Formally, 
% \abcomment{rephrase, may confuse the reader ... e.g., used by whom. We have not discussed estimators yet}. 
% Instead of decoupling the predictor as $f = \vv(x) ^\top \Phi(x)$ \abcomment{what is $v(x)$? first use}
% with a feature mask $\Phi(\vx) \coloneqq \vm \odot \vx$ for $\vm \in \{0,1\}^d$, we can look at a \textbf{sparse predictor} $\vv$ with $\Ds{\vv}_0 < d$ with implicit feature selection. 
% \abdelete{In this case,} \abcomment{use another sentence to drive the point home ... we do not need a binary selector $m$ or need $\Phi(x)$ to be sparse}
let $S$ be a subset of features, $S \in 2^d$, that represents the footprint for $\vv$.
% \abcomment{is this $v$? then lets call it that} 
We denote the set of all predictors that are only nonzero on $S$ as $\Sp(S)$. Then, 
\begin{equation}
\Sp(S) \coloneqq \{ \vv \in \RR^d : \vv_i = 0 ~\forall i \notin S \},
\end{equation}
and contains $\vv$ that can take any value in features in $S$, and are 0 elsewhere.  
% This is analogous to the featurizer, $\Phi : \cX \to \cH$ for an intermediary data representation $\cH \in \RR^d$ introduced by \citet{arjovsky2020invariant}, where the final prediction is $\hat y = f(\Phi(\vx^e), \vv)$. By limiting our search to $\Sp(S)$, we are effectively ``baking in" the featurizer into the linear predictor.

We define the invariant footprint, i.e., the subset of invariant features corresponding to $x_{\inv}$, as $S_\inv$. Formally,
\begin{equation}
\label{def:sinv}
S_{\inv} \coloneqq \{ i \in [d] \mid \vx^e_i \in \vx_{\inv} \}.
\end{equation}
This is a small subset of all features if $d_\inv \ll d$, and at training time, it is not known which of the available features are members of this set. 
We are then interested in seeking the \textbf{optimal invariant predictor}, as defined below.
\begin{definition}[Optimal Invariant Predictor]
Let the optimal invariant predictor $\beta^* $ be
\begin{equation}
    \label{eqn:betastar}
    \beta^* \coloneqq \argmin_{\vv \in \Sp(S_\inv)} \sum_{e\in \cE}\cR^e(\vv).
\end{equation}
% \begin{equation}
% \label{eqn: define invariant optimal}
% \begin{gathered}
% \beta^* \coloneqq \argmin_{\vv} \cR (\vv) \\
% \text { s.t. } 
% \vv  = [\vv_\inv, \vzero^{d_s}, \vzero^{d_r}], \vv_\inv \in \RR^{d_\inv}.
% \end{gathered}
% \end{equation}
In other words, it is the best parameter that relies only on the invariant features $\vx^e_\inv$.
\end{definition} 
% \begin{remark}
Two hurdles are evident: first, finding  $\beta^*$ requires prior knowledge of which features belong in $S_\inv$, which we don't have.
% Second, we don't have access to the population risk in practice.
Thus, it is an information-theoretic target, i.e., without consideration for computational demands, since solving the outer problem of the best subset $S$. This involves searching over  $\binom{d}{d_{\inv}}$ subsets if we know $d_\inv$; otherwise, the search space is $2^{d_\inv}$.
Further, we will be working with empirical loss whereas $\beta^*$ is defined based on population loss. 
% \qed 
\begin{remark}
% The challenge of finding \Cref{eqn:betastar} comes from 
In the problem setting defined by \Cref{eqn:problem-setting}, $\beta^* = [\gamma^\top, (\vzero ^{d_s})^\top, (\vzero^{d_r})^\top]$, and is also a solution to \Cref{eqn:irm}; this is easily shown, and details are provided in \Cref{prop:invariant-optimal-classifier} in the appendix. \qed 
% \abcomment{can this be easily shown, can we cite}
\end{remark}
%that we wish to approximate  when we have finite samples. 
% % \abcomment{some remarks will help: (a) this will need knowledge of which features are invariant, not just $d_\inv$, ... hence information theoretic; (b) this is a population, do we want the finite sample version?; (c) this is not IRM, we will hopefully show that $\hat{\beta}$ based on non-asymptotic IRM will be `close' to population based estimate }
% \end{remark}
% but this is made more challenging with the huge number of spurious and random features.
We also use the subscript $S$ and superscript $e$ notation to represent environment and feature-restricted population optima,
\begin{equation}
\label{eqn:population-optima}
     \beta^e_S \coloneqq \argmin_{\vv \in \Sp(S)}  \cR^e(\vv)
    ,\quad
     \beta^*_S \coloneqq \argmin_{\vv \in \Sp(S)} \sum_{e\in \cE}  \cR^e(\vv).
\end{equation}
We extend this notation to the empirical minimizers,
\begin{equation}
\label{eqn:empirical-optima}
     \hat \beta^e_S \coloneqq \argmin_{\vv \in \Sp(S)}  \hat\cR^e(\vv)
    ,\quad
     \hat \beta_S \coloneqq \argmin_{\vv \in \Sp(S)} \sum_{e\in \cE}  \hat \cR^e(\vv).
\end{equation}
IRM introduces a penalty that penalizes non-invariant classifiers.
It is generally formulated 
\begin{equation}
\label{irm-penalty}
 % \cL_{\text{IRMv1}}(\vv) \coloneqq 
 \cL(\vv) \coloneqq
% \sum_{e\in \cE_{tr}} \cR^e (\vv) + \rho \cJ(\vv)=
\sum_{e\in \cE_{tr}} \cR^e (\vv) + \rho 
% \sum_{e\in \cE_{tr}} \Ds { \nabla_{\vv} \cR ^e (\vv)}^2_2.
\sum_{e \in \cE} \cJ^e(\vv),
\end{equation}
with penalty weight $\rho  > 0$ for some  $\cJ^e: \RR^d \rightarrow \RR^+$ that captures a violation of invariance in $\Phi$ across environments. 
% We will exposit two examples.

For the analysis, we first adapt the  IRM minimax penalty~\citep{zhouSparseInvariantRisk2022}, otherwise called the \textit{loss difference} penalty, as a proxy for the constraint imposed by the original bi-level optimization formulation. 
% With a weight hyperparameter $\rho > 0$, and 
With $\vv_S \in \Sp(S) \subseteq \RR^d$, we have
% \begin{equation}
% % \label{eqn:irm-minimax-vspecific}
%     \cJ^e_{\mm}(\vv_S) =
%     % \sum_{e \in \cE} \cR^e(\beta_S) + \rho
%     % \sum_{e\in \cE} 
%     \max_{\vv^e_S \in \Sp(S)}
% \left[ \cR^e(\vv_S)-  \cR^e\left(\vv^e_S \right)\right].
% \end{equation}
\begin{equation}
\begin{split}
\label{eqn:irm-minimax-vspecific}
 \cL(\vv_S) \coloneqq &
\sum_{e \in \cE}  
\cR^e(\vv_S)  \\
&+ \rho\sum_{e\in \cE} 
\max _{\vv^e_S\in \Sp(S)} 
% \left[\cR^e(\vv_S)- \cR^e\left(\vv^e_S \right)\right],
% = \sum_{e \in \cE} \cR^e(\beta_S) + \rho\sum_{e\in \cE} 
\left[ \cR^e(\vv_S)-  \cR^e\left(\vv^e_S \right)\right].
\end{split}
\end{equation}
% and by definition $\cL(S) = \min_{\vv_S \in \Sp(S)} \cL(\vv_S)$.
If there exists some $\vv_S$ which minimizes \Cref{eqn:irm-minimax-vspecific}, 
% satisfies $\vv_S \in \$
 % the original IRM constraint in \eqref{eqn:irm} is satisfied.
From this, the minimax loss can also be defined for a given subset of features $S \in 2^d$,
% as $\cL(S) \coloneqq \min_{\vv_S \in \Sp(S)} \cL(\vv_S) =  \cL(\beta^*_S).$ For our analysis, we will assume $d_{\inv}$ is known, the subsets $S$ of interest have cardinality $|S|=d_{\inv}$.
\begin{equation}
\label{eqn:irm-minimax}
    \cL(S) \coloneqq \min_{\vv_S \in \Sp(S)} \cL(\vv_S) =  \cL(\beta^*_S).
\end{equation}
However, computing this loss in practice, e.g., to use with gradient descent, would require solving an inner optimization problem in order to find the second term of the penalty, $\min_{\vv^e \in \Sp(S)} \cR^e(\vv^e)$. 
In practice, the penalty is often replaced with the gradient norm penalty introduced by \citep{arjovsky2020invariant}. 
\begin{equation}
\label{eqn:irmv1}
 \cL_{\text{IRMv1}}(\vv) \coloneqq 
% \sum_{e\in \cE_{tr}} \cR^e (\vv) + \rho \cJ(\vv)=
\sum_{e\in \cE_{tr}} \cR^e (\vv) + \rho 
\sum_{e\in \cE_{tr}} \Ds { \nabla_{\vv} \cR ^e (\vv)}^2_2.
\end{equation}
We show in \Cref{prop:lossdiff} why this is an appropriate proxy for the minimax loss under reasonable assumptions, which are satisfied with linear least squares.

\begin{remark}
While the formulation in \Cref{eqn:irmv1} is a commonly used penalty in IRM optimization, and receives a detailed treatment in \citet{fan2024eills}, it does not absolve the need to search over all subsets $|S| \le d_\inv$ which may provide candidates for the invariant classifier. 
In fact, the variation across different subsets in \Cref{eqn:irm-minimax}, under loss functions optimized over all of $\RR^d$, prevents the direct application of LASSO and other convex relaxation techniques to \Cref{eqn:irm-minimax}.
As a result, both \citet{zhouSparseInvariantRisk2022} and \citet{fan2024eills} resort to gradient descent over the full space of $\RR^d$.
% We then provide what is, to our knowledge, the first error analysis of a computationally efficient method that recover 
% invariant optimal features, via Iterative Hard Thresholding.
    \qed
\end{remark}




\subsection{Optimization}
% \abcomment{We have not yet defined the problem {\em we} want to solve. Without that, this section is incomplete.}
% \abcomment{(7) and (8) are written in terms of population risk and given sets. Please present what we will solve: first, use empirical version of (8), optimize over $v_S$ and on the outside optimize over $S$. In such combinatorial/info-theoretic setting, we will establish sample complexity results which makes $S_{\inv}$ optimal; we will also show that with the same order sample complexity $\beta^*$ is optimal. Subsequently, we will introduce practical versions based on $L_1$ norm and IHT, etc} 

% While this penalty is an appropriate heuristic for the IRM paradigm, correctly having a minimum when all the environmental minima $\vv^e$ are equal to the resulting classifier $\vv$, we will look for a computational proxy for implementation.
With the loss defined for the population case, we are ready to examine the minimax formulation for finite samples, the empirical counterpart to \Cref{eqn:irm-minimax}:
\begin{equation}
\begin{split}
\label{eqn:irm-minimax-empirical}
\hat \cL(\vv_S) \coloneqq &
\sum_{e \in \cE}  
\hat \cR^e(\vv_S)  \\
&+ \rho\sum_{e\in \cE} 
\max _{\vv^e_S\in \Sp(S)} 
% \left[\cR^e(\vv_S)- \cR^e\left(\vv^e_S \right)\right],
% = \sum_{e \in \cE} \cR^e(\beta_S) + \rho\sum_{e\in \cE} 
\left[ \hat \cR^e(\vv_S)- \hat \cR^e\left(\vv^e_S \right)\right].
\end{split} 
\end{equation}
Again, we use the empirical minimizers defined in \Cref{eqn:empirical-optima} to indicate the loss incurred by the minimum of a given subset of features $S$:
\begin{equation}
\label{eqn:irm-minimax-empirical-vspecific}
    \hat \cL(S) = \min_{\vv_S \in \Sp(S)} \hat \cL(\vv_S) = \hat \cL(\hat \beta_S).
\end{equation}
This results in a two-step breakdown of the IRM problem. 
First, for any given subset $S \in 2^d$, we solve \Cref{eqn:irm-minimax-empirical}, which can be solved by standard optimization methods.
Second, we need to optimize over different subsets $S \in \cS \subseteq 2^d$, to obtain the minimum over all subsets, which is a combinatorial problem over $\binom{d}{d_{\inv}}$ subsets:
\begin{equation}
    \label{eqn:outer-problem}
    \hat \cL(\bar S) = \min_{S \in \cS} \hat \cL(S).
\end{equation}
In this setting, we first provide a sample complexity result (\Cref{thm:info-theory}) when $S_\inv$ is optimal, i.e., the minimum number of samples $n \geq n_0$ needed per environment such that $\bar S = S_\inv$. Thus, if $n \geq n_0$, running the combinatorial optimization will indeed find the correct set of invariant features. 

% We show additionally that the loss difference penalty can also recover the population loss.
% Then, we show that with the same order of sample complexity, based on the minimax penalty, the \textit{invariant optimal predictor} $\beta^*$ corresponding to the \textit{population loss}, still achieves the minimum empirical loss among all population estimators $\beta^*_S$ with $|S| = d_{\inv}$ (\Cref{thm:info-theory-popn}). 
% Subsequently, we provide practical, efficient implementations to solve \Cref{eqn:outer-problem} in \Cref{sec:implementation}.




\begin{remark}
\citet{zhouSparseInvariantRisk2022} implement such optimization by searching for a robust, sparse data representation by applying ProbMask, a subnet-discovery alogrithm \citep{zhou2021effective} to the IRM problem.
% to promote robustness and computational efficiency. 
However, this approach is not \textit{sparse feature selection}, but rather finding a \textit{robust representation that is sparse}: the output of the representation is not necessarily sparse, and the last linear layer of their model is fully connected and dense. Both our analysis (\Cref{sec:technical}) and experiments (\Cref{sec:experiments}) follow the line of \textit{sparse feature selection} instead, by explicitly applying a sparsity constraint on the last layer. \qed 
% \abcomment{we should avoid referring to specific parts of other papers, the current paper ceases to be self contained. Is it necessary to call out specific details or can this be written more generally} 
\end{remark}
% This algorithm explicitly excludes the final layer $\vv$ from masking, meaning that the final classification is being done on a dense selection of features. 
% As such, the ProbMask approach is not \textit{sparse feature selection}, but rather finding a \textit{robust representation that is sparse}. 
% However, their theoretical analysis presents feature extractor $\Phi$ as a mask, as demonstrated in their Theorem 1, which is sparse feature selection. Both our analysis (\Cref{sec:technical}) and experiments (\Cref{sec:experiments}) follow the line of \textit{sparse feature selection}.
\begin{remark}
% \color{blue}
We also note that although many earlier works consider IRM for classification \citep{rosenfeld2020risks,wangProvableDomainGeneralization2022}, our regression model can be generalized to classification with conditional Bernoulli (or conditional multimodal) models. Further detail can be found in \Cref{sec:glm}.
% Using Generalized Linear Models, we can extend these regression results to a wide class of problems, including, e.g., binary classification on a conditional Bernoulli model. .
    \qed
\end{remark}





