%!TEX root = ../sublime-text.tex

% The dominant strategy in the landscape of modern machine learning models consists of finding larger models to run on larger datasets

While overparameterized deep neural networks (DNN) are ubiquitous in the modern landscape of machine learning, the risk of memorization or other shortcuts leading to poor out-of-distribution performance remains an issue. 
The baseline assumption that training and test data are drawn i.i.d. from the same distribution, necessary for Empirical Risk Minimization (ERM) \citep{vapnik1991} to provide generalization guarantees, is arguably not true in many modern settings but also challenging to work around. %both in practice and in theory.

One approach to out-of-distribution (OOD) generalization is Invariant Causal Prediction (ICP) \citep{peters2015}, in which data is drawn from different training environments, but the parent ``causes" of the label, or target variable, are unchanging and independent of  the environment. In other words, given the set of causal features, the conditional distribution of the label must be identical across multiple training environments. 
A popular line of work that has been developed in recent years is Invariant Risk Minimization (IRM)~\citep{arjovsky2020invariant}, which aims to find an invariant data representation that induces a  classifier that performs uniformly across all environments, including unseen test environments. 

To take an illustrative example, consider classifying cows and camels. ~\citep{beery2018, arjovsky2020invariant} show that a model may be fooled into learning the background (green pastures and yellow desert respectively) over actual identifying features, thereby misclassifying, e.g., a cow on beach sand. 
Many following works build on this paradigm to adapt it to a variety of experimental and theoretical frameworks~\citep{ahuja2022invariance,pmlr-v119-ahuja20a_games,linBayesianInvariantRisk2022,pmlr-v139-creager21a}. 
 % IRM has seen success in applications  in NLP, deep learning, and ... \jdcomment{more, + cite}
% Invariant Causal Prediction \citep{peters2015} constructs a framework in which training data are sampled from a number of separate domains, or environments,
% % generated by a Structural Equation Model (SEM). 
% To perform well on an unseen test domain, the model must learn causal features. 
% ICP suggests that the causal relationships between features and label are persistent across different environments, but other features may change. 
% Then, to find the causal features, a model should seek features that are invariant across relationships. 
% paradigm addresses the problem of OOD generalization by defining ``invariant" and ``spurious" features. 
% If a predictor performs equally across environments in order to discard  ``spurious" features, keeping only the ``invariant" features. 
% The Invariant Risk Minimization (IRM) \citep{arjovsky2020invariant} principle
% aims to learn a data representation that lets a linear classifier perform equally across different trianing environments. It is then said to an ``invariant" feature representation, ideally having discarded ``spurious" features that are environment-specific.
% Many following works build upon this paradigm, extending it with concepts from information bottleneck \citep{ahuja2022invariance}, game theory, bayesian networks, and REx \citep{pmlr-v139-creager21a} \jdcomment{cite} \citep{pmlr-v139-creager21a,ahujaInvariancePrincipleMeets2021}. 
However, others have identified when it is impossible to provide formal guarantees in the nonlinear and linear regimes \citep{rosenfeld2020risks}, which can lead to poorer generalization than unconstrained ERM \citep{pmlr-v130-kamath21a}. 
Prior works demonstrate a large train-test gap
in a variety of models and domain generalization datasets \citep{Lin_2022_CVPR, zhouSparseInvariantRisk2022, pmlr-v139-krueger21a, gulrajani2020search}.
% \jdcomment{some practical tests to cite, find more: (Lin et al., 2022; Zhou et al., 2022b), environment difficulty (Dranker et al., 2021; Krueger et al., 2021), and dataset type (Gulrajani  Lopez-Paz, 2020)}. 
% to perform less effectively in cases \citep{gulrajani2020search, }, and subsequent works have indentified  that stringent conditions are required for formal guarantees in the linear case, \citep{rosenfeld2020risks}, catastrophic failure in the non-linear regime when the train and test distributions are sufficiently different, and \jdcomment{cite more} \citep{zhang2023missing, gulrajani2020search}. \citet{zhouSparseInvariantRisk2022} . 


\citet{fan2024eills} addresses the statistical challenge of estimating a stable linear relationship across multiple environments with a data model that relaxes restrictions on the heterogeneity of the environments, requiring only the conditional expectation of the response and not the  joint distribution to remain invariant for invariant features, but is restricted to linear models and does not extend to overparameterized deep models. In contrast, \citet{zhouSparseInvariantRisk2022} suggest that IRM fails to drop spurious features when paired with deep models. They propose a global sparsity constraint to further eliminate spurious features from  feature representation, based on a probabilistic approach \citep{zhou2021effective} to the lottery ticket hypothesis \citep{lotteryfrankle2019}. 

Two key challenges arise from this line of work. First, the sample complexity result in \citep{zhouSparseInvariantRisk2022} does not correctly capture the non-asymptotic case, due to errors in the analysis which mix up empirical and population terms, and the result incurs an additional dependency on the ambient dimensionality when working with finite samples.  
Second, the existing methods for sparse IRM are computationally inefficient. They either require searching over subsets of features, \citep{fan2024eills} or probabilistically
prune network weights, which is computationally slow \citep{zhouSparseInvariantRisk2022}.


We address the first challenge by providing a correct result through a generalized information-theoretic analysis. 
With $d_{\inv}$ invariant features and $d$ total features, we present an information theoretic analysis with $L_0$-norm constraint selecting $d_{\inv}$ features. We show that a variant of the IRM formulation will provably find the correct $d_{\inv}$ features, with sample complexity depending polynomially on $d_{\inv}$ and logarithmically on $d$. 
The analysis is effectively information theoretic, with no consideration for computational demands, and it implies working with all the $\binom{d}{d_{\inv}}$ $L_0$-constrained problems, but showing that this will identify the correct invariant features.
For the second concern, we focus on practical efficient algorithms based on projected gradient descent (PGD) based on $L_1$-norm constraints and iterative hard thresholding (IHT) \citep{blumensath_2009iht, jain_iterativehardthreshold_2014}, to avoid the combinatorial complexity of the $L_0$-constrained approach.  Our approach is efficient and guaranteed to recover the invariant optimal predictors. To summarize, our work makes the following contributions:
% Their method was demonstrated to have better generalization on a number of Domain Generalization datasets, and they additionally provide a theoretical analysis of a linear model with strong spurious correlations.
% \abcomment{We need to have 2 paras explaining what we do in the paper. I can take first crack at this by Sat.}

{\bf Non-Asymptotic Theory.} We present a  non-asymptotic analysis of using sparsity to select invariant features on the proposed IdepRM penalty. Our results show that $L_0$ constrained estimation in IRM is able to find the correct invariant features under suitable assumptions. The sample complexity is $O(\text{poly}(d_{\inv}) \log (d))$, where $d$ includes the undesirable features. 
Our model captures more realistic scenarios where spurious features vary in their correlation with the label, using novel \textit{scale} parameters. Expressing complexity in terms of these parameters decouples it from dataset size (see Section \ref{sub:data_generation}).



 % shows the complexity's dependence on two \textit{scale} parameters which capture correlation between spuriuous features and label. We show that even with 


{\bf Modularity.} Prior work on sparse IRM necessarily trains deep neural networks with sparse subnetwork selection, i.e., to change the training procedure to get invariant features. 
In contrast, our approach, based on sparsity on the last layer of the neural network, can be directly applied to many different settings, including the myriad pretrained models, 
without the need to change their training. 
Further, the modularity in our approach, i.e., feature selection happening at the last layer, makes it flexible by allowing the ability to ``hot-swap" different sparse estimators, e.g., based on IHT or PGD with convex relaxations. 
%such as the $L_1$ norm.

{\bf Experiments.} We present experimental results with different instances of our sparse IRM, demonstrating better performance than that of existing IRM methods, including Sparse IRM with subnetworks. We also show that our methods are computationally efficient.
%Also, their work focuses on finding a sparse subnet of the feature extractor, and not sparsity on the features themselves. Our methods focus on the sparsity of the final feature selected, providing similar or better performance on fewer features while bringing the experiments closer to the theory. 









