\subsection{todos}

\begin{enumerate}
    \item Fill in the L0 to L1 gap; take a more careful look at statistical incoherence (and leverage the knowledge we have about $\vv_s$)
    \item $\hat v_{s,irm}$ comparison. Show the loss is smallest for $\hat \vv_\inv$!
    \item Experiments: \#1 is a stability issue. Should be able to reproduce MLP results from Zhou 2021. 
    \item Experiments: Set up either FullColoredMNIST or MNIST-CIFAR for the meta experiments, otherwise it's too easy.
    \item Hyperparameter search: see \url{https://github.com/facebookresearch/InvariantRiskMinimization/blob/main/code/colored_mnist/reproduce_paper_results.sh} for IRM og.
    
    % \item How do we 
    % - \hl{resolve $[\gamma, 0, 0] \notin \argmin_{\vv} \cR(\vv) $}
    % \item Multitask (L21) analysis - problem setting
    % \item Optimization results -- does the algorithm minimize (theirs)
    % $\hat \cL_{\text{IRM}} (\Phi_\inv)$ (ours) $\hat \cL_{\text{IRM}} (\vv_\inv)$? 

    % \item How to take theoretical results from linear (linear $\Phi$) to DNN setting?
    % \item Directly analyze the original IRMv1 penalty
\end{enumerate}


\subsection{bookkeeping  errors in /changes from Sparse IRM 2022}

These are not in any order. 

\begin{itemize}

    \item Corrections for Step 2)
    \begin{observation}
    The following equation is from the beginning of Step 2) and is used to derive the first equality in the evaluation of $\hat \cL (\Phi_{inv,r}) - \hat\cL(\Phi_{inv})$.
    \begin{equation}
        \sum_e \mathbb{E}^e\left(y-\left\langle\beta, \Phi_{i n v, r}(\boldsymbol{x})\right\rangle\right)^2-\mathbb{E}^e\left(y-\left\langle\beta, \Phi_{i n v}(\boldsymbol{x})\right\rangle\right)^2=\sum_{i \in\left[d_r(\Phi)\right]} r_i \mathbb{V}\left(x_{i n v, i}\right)
    \end{equation}
    First, I cannot find the definition of $r_i$, and the description mentions ``replacing" a feature instead of just adding one. 

    Second, it's all in population loss and cannot be immediately substituted into the derivation, and I don't think this is solved by a clever definition of $r_i$.

    Third, a naive interpretation of this equation would assume that both featurizers select the same invariant features. In that case, it does not work with Step 2) which needs to hold for all featurizers $\Phi$ that have invariant and random features only, at most.
    
    \end{observation}
    \item Corrections for \cref{eqn: inv,r,s vs inv}
    \begin{observation}
    I find that some of the coefficients + signs in Step 1) (right below equation (29)) of \cite{zhouSparseInvariantRisk2022} are incorrect.

    For reference, $\xi_b$ is defined at the top of the proof B.6.3, in the expansion of the penalty term. The sign preceding $\xi_b$ is consistently flipped in Step 1) and Step 2), which means that the definition (using under braces) in the top of B.6.3 should include the $-$ sign in $\xi_b$.

    Additionally, their $\lambda$ in Step 1) (renamed to $\eta$ in my analysis)
        
    \end{observation}
    \item Corrections for $\xi_c$ analysis:
    \begin{observation}
    \label{obs: xic lambdamax}
    Original had $\lambda_i(\frac{1}{\lambda_i^e} - \frac{1}{\lambda_i^e})^2$ bounded by $\geq \frac{d_s(\Phi) C^2 \lambda_{\min } \Delta^2}{\lambda_{\max }^4} \geq \frac{C^2 \lambda_{\min } \Delta^2}{\lambda_{\max }^4}$. When I expanded it I got the expression written in \cref{eqn: xic final} instead; the $\lambda_i$ cancels itself out. Not sure if I did some extra cancellation.

    Additionally, because we do not use a separate $\Phi$, the new definition of $\xi_c$ looks at all entries of $\vx$. Thus, it always falls into the second case as written in equation (29) of \citep{zhouSparseInvariantRisk2022}.
    \end{observation}
    \item Algorithm 1: $\Phi$ here appears to be a feature mask applied to the deep neural network \textit{everywhere except the last layer}. Is this correct? If so, this does match equation (7), but it conflicts with the IRM paradigm conceptually. In IRM, we want to find invariant features in the last layer.

    It is possible that the sparsification is only occurring in the last layer; their notation might also imply this: $\cL_\cB$ takes two parameters, according to their ProbMask paper \citep{zhou2021effective}, of matching dimension. The first is the set of parameters to be masked/sparsified, which is $\vv$ here (last layer only) and the second is the mask, approximated using Gumbel-Softmax. Then, Gumbel-Softmax in \citep{zhouSparseInvariantRisk2022} is applied incorrectly.

    \item Corrections for definition of $\alpha_i$:
    \begin{observation}
    \label{obs: alpha def}
    Assumption 4 (theirs): changed out for \Cref{assn: spurious diff} on my side. The original is:
    \[\alpha_i = \sqrt{\sum_e (\alpha^e_i)^2}\]
    which is replaced (as in \Cref{assn: spurious diff}) to be:
    \[\alpha_i = \sqrt{\frac{1}{\ds{\cE}} \sum_e (\alpha_i^e)^2}\]

    The original holds trivially, and their analysis only works with the mean instead of the sum. 
    \end{observation}

    \item Between (21) and (22), the switch from $\hat \Sigma^e$ to $\hat \Sigma^{e,-1}$ is incorrect. I am not sure if their subsequent application of Lemma 2 is still valid. Either way, we expect to see some $\lambda_{\max} \Sigma^e$ term appear within this expression.
    \item Equation (25) switches back to $\Sigma^e$, which means (26) is correct.
    \item Equation (28) directly replaces $\sum_e \EE^e (y-\langle\hat{\vv}, \vx\rangle)^2$ with $\EE(y-\langle\hat{\vv}, \vx\rangle)^2$, which is wrong. While $\vv$ might be the same minimizer, this is no longer the IRM-minimax penalty. This does not violate the conclusion that $\ds{\xi} = O(\ds{\xi_b})$, but the sample complexity of (28) is too small by $\ds{\cE}^2$.
    \item Step 1) has a sign error (should be $-\xi_b(\Phi_{\text{\inv, r, s}})$)
    \item Step 1) I don't see a reason they need $-(4\lambda + 1)$ as a coefficient instead of $(2\lambda + 1)$. A constant difference doesn't really matter, but it's weird that they took $A \ge -|A|$ instead of $A \ge 0$ for positive $A$.
    % \item (not an error) I thought $\EE[y\vx] = \EE^e[y\vx]$ was a bad assumption, but it's correct.
    \item Step 2) also has sign errors. The first occurrence of $\xi_b(\Phi_{\inv, r})$ should be negative, and the appearance of $-\lambda \xi_b$ in the second equality should be positive.
\end{itemize}

\subsection{Other proofs for myself}
Taken from \cite{jin2019short}. \hl{Not completely sure why $d$ is showing up in RHS}
\begin{proposition}
For a sub-Gaussian vector $X\in \RR^d$, we may bound with probability $1-\delta$ its L2 norm:
\begin{equation}
\EE \Ds{X}_2 \le 4\sigma \sqrt d
\end{equation}
The high-probability bound is
\begin{equation}
    \Ds{X}_2 \le 4\sigma \sqrt d + 2 \sigma \sqrt {(\log 1/\delta)}
\end{equation}
for $\sigma^2 = \max_i \Var(X_i) $.
% Either that or $\sigma^2 = \max_i \EE \ds{X_i}^2_2$.
\end{proposition}
\begin{proof}
A vector of random variables $X = [X_1, X_2, \cdots, X_d]$ is sub-Gaussian if $\va^\top X$ is sub-Gaussian for all $\Ds{\va}_2 = 1$, that is $\va$ in sphere $S^{d-1}$. This is satisfied easily for independently sub-Gaussian $X_i$, for example.

% Let $\vmu = \EE[X]$. 
$\EE[\Ds{X_i}^2] = \Var (X_i) + (\EE[X_i])^2 \le 2\Ds{X_i}^2_{\psi_2}$ by the definition of sub-Gaussian norm $\Ds{X_i}_{\psi_2}$ from \citep{vershyninIntroductionNonasymptoticAnalysis2011}

We will make use of a $\epsilon$ cover: let $N_\epsilon$ satisfy $\forall \vv \in B_d, \exists z \in N_\epsilon \ s.t. \Ds{\vv - \vz}_2 \le \epsilon$ for $\epsilon \in (0,1)$. We also must have $\ds{N_\epsilon} \le (1+\frac{2}{\epsilon})^d = b^d$ for some constant $b > 0$. 
% $\epsilon = 1/4$ gets us $9^d$, 
For $\epsilon = 1/2$ we get $\ds{N_\epsilon} \le 5^d$.

We want a bound of $\Ds{X}_2 = \max_{\Ds{\vv}_2\le 1 } \vv^\top X$.

Given this, 
\[
\max_{\vv \in B_d} \vv^\top X 
% = \max_{\vz \in N_\epsilon} \ps{\vz + (\vv-\vz)}^\top X 
\le  \max_{\vz \in N_\epsilon} \vz^\top X + \max_{\Ds{\vw}_2\le \epsilon} \vw ^\top X
\le \max_{\vz \in N_\epsilon} \vz^\top X + \epsilon \max_{\Ds{\vw}_2\le 1} \vw^\top X 
\]
which shows that 
\[\max_{\Ds{\vv}_2 \le 1} \vv^\top X \le \frac{1}{1-\epsilon} \max_{\vz \in N_\epsilon} \vz^\top X\]

The statement in high probability:
\[
\Pr \ps{ \Ds{X}_2\ge t}  
\le \ds{N_\epsilon} \exp(1-\frac{c't^2}{(1-\epsilon)^2\Ds{\vz^\top X}_{\psi_2}^2})
\le b^d\exp(1-\frac{ct^2}{\Ds{ X_i}_{\psi_2}^2})
\]
for some constant $c>0$. 

If $t \le $

Upper bounding the probability $\delta$:
\[\Ds{X}_2\le \Ds{X_i}_{\psi_2}\sqrt{\log (1/\delta) + d +1} 
=O(\Ds{X_i}_{\psi_2} \sqrt{\log (1/\delta) + d})
 \]

\end{proof}
