\label{sec:glm} 

We start by restating the original data generation model:
\begin{equation}
\label{eqn:problem-setting-glmrestate}
\begin{aligned}
& y=\gamma^{\top} \vx_{\inv}+\epsilon_{\inv}~, \\
& \vx_s^e=y \vzeta_s+\valpha^e \odot \vepsilon_{s}~, \\
& \vx_r=\vzeta_r \odot \vepsilon_{r}~.
\end{aligned}
\end{equation}

GLMs are based on exponential family distributions 
\citep{brown1986book,Barndorff-Neils2014-hc,banerjee2015estimation},
% \citep{banerjee2015estimation},
where we assume the conditional distribution of a response $y_i$ conditioned on covariates $\vx^e_i$ is an exponential density function: 
\begin{equation}
\label{eqn:cond-distr-glm}
    P(y_i |\vx^e_i, \beta^*_\inv) = \exp \{ y_i \langle \vx^e_i, \beta^*_\inv \rangle  - \varphi (  \langle \vx^e_i, \beta^*_\inv \rangle  ) \}
     = \exp \{ y_i \langle \vx_{\inv,i}, \gamma \rangle  - \varphi (  \langle \vx_{\inv,i}, \gamma \rangle  ) \},
\end{equation}
for log-partition function $\varphi\left(\left\langle \vx^e_i, \beta_\inv^*\right\rangle\right)=\log \left(\int_{y_i} \exp \left\{y_i\left\langle \vx^e_i, \beta_\inv^*\right\rangle\right\} d y_i\right) $. 
For simplicity of notation, we can represent the parameter $\eta_i = \langle \vx_{\inv,i}, \gamma \rangle $.
Then the new environmental risk is the negative log-likelihood for the conditional pdf. If we use $D_n = \bigcup_{e\in \cE}\{(\vx^e_i, y_i)\}_{i=1}^{n_e} $ to be the entire dataset across different parameters,
\begin{align}
    \label{eqn:loss-glm}
    \cR^e (\beta, D_n) 
    &= -\frac{1}{n} \sum_{e\in \cE}\sum_{i=1}^{n_e} y_i \vx^e_i
    +\frac{1}{n} \sum_{e\in \cE} \sum_{i=1}^{n_e} \vx^e_i
    \frac{\partial \varphi\left(\left\langle\beta_\inv^*, \vx^e_i\right\rangle\right)}
    {\partial \eta_i}
    %
    \nonumber \\
    &=\frac{1}{n} \sum_{e\in \cE}
    \sum_{i=1}^{n_e} 
    \vx^e_i\left(E\left[y_i \mid \vx^e_i\right]-y_i\right)
    \nonumber
    \\
    &=\frac{1}{n} X^\top \left(E\left[y_i \mid \vx^e_i\right]-y_i\right).
\end{align}
We present the conditional Bernoulli distribution example \citep{banerjee2015estimation,Dunn2018-tj}. 
Using parameter $p_i$ for the conditional mean,  
\begin{align}
    P(y_i, | p_i) &=  p_i^{y_i}\left(1-p_i\right)^{\left(1-y_i\right)}
    \nonumber \\
    % &=\exp \left(y_i \log p_i+\left(1-y_i\right) \log \left(1-p_i\right)\right)
    &=\exp \left(y_i \log \left(\frac{p_i}{1-p_i}\right)+\log \left(1-p_i\right)\right).
\end{align}
Then the parameter $\eta_i =\langle \vx_{\inv,i}, \gamma \rangle = \log \ps { \frac{p_i}{1-p_i}} $. We then end up with logistic regression, where 
\begin{equation}
\label{eqn:logistic-reg}
p_i = \frac{\exp(\langle \vx_{\inv,i}, \gamma \rangle) }{ 1 + \exp(\langle \vx_{\inv,i}, \gamma \rangle)}.
\end{equation}
In this case the link function is $\log (1-p_i)$.
We emphasize in this setting that this only depends on the invariant features $\vx_\inv^e$ and not those of the spurious. 

The corresponding IRM penalty term is then 
\begin{equation}
\label{eqn:ber-penalty}
    \cJ (\beta) = \sum_{e\in \cE} \max_{\beta^e_S \in \Sp (S)} [\cR^e (\beta_S) - \cR^e(\beta^e_S) ].
\end{equation}

By showing RSC and RSS for the loss function $\sum_{e\in \cE}\cR^e(\beta) + \rho \cJ(\beta)$, we can recover \Cref{thm:info-theory}. 
