\begin{enumerate}
    \item Finish current proof writeup -  clarify notation
    % - \hl{resolve $[\gamma, 0, 0] \notin \argmin_{\vv} \cR(\vv) $}
    \item Multitask (L21) analysis - problem setting

    \item \hl{
    Optimization results for linear setting} -- does the algorithm minimize 
    (theirs) $\hat \cL_{\text{IRM}} (\Phi_\inv)$ 
    (ours) $\hat \cL_{\text{IRM}} (\vv_\inv)$? 
    
    % -- does the algorithm minimize (theirs) $\hat \cL_{\text{IRM}} (\Phi_\inv)$ (ours) $\hat \cL_{\text{IRM}} (\vv_\inv)$? 

    \begin{enumerate}
        \item (theirs) Does Algorithm 1 of \cite{zhouSparseInvariantRisk2022} optimize $\hat \cL_{\text{IRM}} (\Phi_\inv)$ (eq. 8)? No. Their SparseIRM algorithm is a reskinned version of Algorithm 1 (ProbMask) of their prior work \cite{zhou2021effective}, which solves a \textbf{relaxed problem} (eq. (3) in \cite{zhou2021effective}):
        \[\begin{gathered}
        \min _{\boldsymbol{w}, \boldsymbol{s}} \mathbb{E}_{p(\boldsymbol{m} \mid \boldsymbol{s})} \mathcal{L}(\boldsymbol{w}, \boldsymbol{m}) \\
        \text { s.t. } \boldsymbol{w} \in \mathbb{R}^n, \mathbf{1}^{\top} s \leq K \text { and } s \in[0,1]^n .
        \end{gathered}
        \]
        \item (ours) Do we have an algorithm that solves \cref{eq:basic-zsetup}? PGD will work, but there is no longer any intuitive connection between the invariant dimension $d_\inv$ and the constraint $K'$. (This is not an issue in practical application)

        \item I don't think solving these optimization problems is guaranteed to retrieve $\gamma$ in either case.
    \end{enumerate}

    \item \hl{How to take theoretical results from linear to DNN setting?}

    DNNs are highly nonconvex. Of course, the experiments in \cite{zhouSparseInvariantRisk2022,zhang2023missing} show decent (70\%ish accuracy) generalization with good sparsity on highly nonconvex datasets like modified Cifar-MNIST. 

    Theoretical results: we're only interested in the sparsity of $\vv$, not of the DNN part $\Phi$. Is there some way to claim, if $\vz = \Phi(\vx)$,

    \[y = \gamma \vz_\inv + \epsilon_\inv\]

    and so forth? Of course, if the data were generated like this in the first place, and the observable features were defined $\vx = \Phi^{-1}(\vz)$, we can directly apply our linear results on $\vz$. This is akin to the perspective of the nonlinear analysis done by \citet{rosenfeld2021risks}. That paper lists many issues/traps with this approach.

    Question: in order to learn both $\Phi$ and $\vv$, how do we guarantee we get the desired $\Phi$? Does sparsity guarantee this? It might, given the right assumption on the suprious variables. How do we
    prevent $\Phi$ from blowing up on near-zero elements of $\vv$?

    \item Directly analyze the original IRMv1 penalty 

    \item Ingoring for this paper: A new problem, which I don't think OG paper discusses: the empirical loss is over a subset of environments $\cE_{tr} \subset \cE$ which kills some of the $\xi_a$, $\xi_b$ analysis from here down. Probably need to add in an extra term to account for the quality of training environments.

% does their algorithm minimize the loss that we promise? What is the proof for this?
\end{enumerate}