\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{ bbold }

\usepackage{ amssymb }
\usepackage{bbm}
\usepackage{bm}

\usepackage{amsmath,amsthm}
\usepackage[utf8]{inputenc}
\usepackage{microtype}
\usepackage{ulem}
% % \usepackage{colortbl,arydshln} % color and dashed lines in arrays
% %\usepackage[charter,cal,scr]{mathdesign}
% \usepackage[mathscr]{euscript}
% \usepackage[all]{xy}
% \usepackage[x11names,dvipsnames]{xcolor}
\usepackage{tikz}
\usetikzlibrary{matrix, positioning}
\usetikzlibrary{automata,arrows}
\usepackage{listings}
\usepackage{mathtools}
\usepackage{verbatim}
\usepackage{subcaption}
\usepackage{algorithmicx}
\usepackage{algorithm}
\usepackage{algpseudocode}

\title{Causal Information Splitting: \\ Engineering Proxy Features for Robustness to Distribution Shifts}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{https://bijanmazaheri.com/}{Bijan Mazaheri}}
\author[2]{Atalanti Mastakouri}
\author[2]{Dominik Janzing}
\author[2]{Michaela Hardt}
% Add affiliations after the authors
\affil[1]{%
    California Institute of Technology\\
    Pasadena, CA, USA
}
\affil[2]{%
    Amazon Causality Lab\\
    Tübingen, Germany
}

\newcommand{\zset}{\mathbb Z}
\newcommand{\fset}{\mathbb F}
\newcommand{\nset}{\mathbb N}
\newcommand{\rset}{\mathbb R}
\newcommand{\cset}{\mathbb C}
\newcommand\flowsfrom{\mathrel{\reflectbox{$\leadsto$}}}
\def\Real{\mathbb{R}}
\def\Proj{\mathbb{P}}
\def\Hyper{\mathbb{H}}
\def\Integer{\mathbb{Z}}
\def\Natural{\mathbb{N}}
\def\Complex{\mathbb{C}}
\def\Rational{\mathbb{Q}}
\def\Field{\mathbb{K}}
\newcommand{\ep}{{\varepsilon}}

\let\K\Field
\let\N\Natural
\let\Q\Rational
\let\R\Real
\let\C\Complex
\let\Z\Integer

% ---- OPERATORS (requires amsmath) ----
\def\argmax{\operatornamewithlimits{arg\,max}}
\def\argmin{\operatornamewithlimits{arg\,min}}
\def\rank{\operatorname{rk}}
\def\nnrank{{\operatorname{rk}_+}}
\newcommand{\indep}{\perp \!\!\! \perp}
\newcommand{\G}{\mathcal{G}}
\newcommand{\Xin}{\vec{X}_{\mathrm{in}}}
\def\De{\operatorname{\mathbf{DE}}}
\def\Err{\operatorname{Err}}
\def\An{\operatorname{\mathbf{AN}}}
\def\Ch{\operatorname{\mathbf{CH}}}
\def\ch{\operatorname{\mathbf{ch}}}
\def\Pa{\operatorname{\mathbf{PA}}}
\def\CoAn{\operatorname{\mathbf{COAN}}}
\def\Fb{\operatorname{\mathbf{FB}}}
\def\Fm{\operatorname{\mathbf{FM}}}
\def\LF{\operatorname{\mathbf{LF}}}
\def\pa{\operatorname{\mathbf{pa}}}
\def\Mb{\operatorname{\mathbf{MB}}}
\def\Nb{\operatorname{\mathbf{NB}}}
\def\mb{\operatorname{\mathbf{mb}}}
\def\Nd{\operatorname{Nd}}
\def\Pr{\operatorname{Pr}}
\def\I{\operatorname{\mathcal{I}}}
\def\H{\operatorname{\mathcal{H}}}
% ---- RELATORS ----
\def\deq{\stackrel{\scriptscriptstyle\triangle}{=}}	% Use := instead.
\def\into{\DOTSB\hookrightarrow}		% = one-to-one
\def\onto{\DOTSB\twoheadrightarrow}
\def\inonto{\DOTSB\lhook\joinrel\twoheadrightarrow}
\def\from{\leftarrow}
\def\tofrom{\leftrightarrow}
\def\mapsfrom{\mathrel{\reflectbox{$\mapsto$}}}
\def\longmapsfrom{\mathrel{\reflectbox{$\longmapsto$}}}

% ---- DELIMITER PAIRS ----
\def\floor#1{\lfloor #1 \rfloor}
\def\ceil#1{\lceil #1 \rceil}
\def\seq#1{\langle #1 \rangle}
\def\set#1{\{ #1 \}}
\def\abs#1{\mathopen| #1 \mathclose|}			% use instead of $|x|$ 
\def\norm#1{\mathopen\| #1 \mathclose\|}		% use instead of $\|x\|$ 
\def\indic#1{\big[#1\big]}		% indicator variable; Iverson notation
								% e.g., Kronecker delta = [x=0]

% --- Self-scaling delmiter pairs ---
\def\Floor#1{\left\lfloor #1 \right\rfloor}
\def\Ceil#1{\left\lceil #1 \right\rceil}
\def\Seq#1{\left\langle #1 \right\rangle}
\def\Set#1{\left\{ #1 \right\}}
\def\Abs#1{\left| #1 \right|}
\def\Card#1{\left| #1 \right|}
\def\Norm#1{\left\| #1 \right\|}
\def\Paren#1{\left( #1 \right)}		% need better macro name!
\def\Brack#1{\left[ #1 \right]}		% need better macro name!
\def\Indic#1{\mathbbm{1}\left[ #1 \right]}		% indicator variable; Iverson notation


%
%  Macros to typeset sets like {foo|bar} with all three delimiters
%  correctly scaled to fit.  What I *really* want is a \middle macro
%  that acts just like \left and \right.  Grumble.
%
\makeatletter
\def\Bigbar#1{\mathrel{\left|\vphantom{#1}\right.\n@space}}
\def\Setbar#1#2{\Set{#1 \Bigbar{#1 #2} #2}}
\def\Seqbar#1#2{\Seq{#1 \Bigbar{#1 #2} #2}}
\def\Brackbar#1#2{\Brack{#1 \Bigbar{#1 #2} #2}}
\makeatother 

\def\complement#1{\overline{#1}}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{example}{Example}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{observation}{Observation}
\newtheorem{claim}{Claim}
\newtheorem{corollary}{Corollary}

\renewcommand{\vec}{\bm}
\newcommand{\dmax}{d_{\mathrm{max}}}

\def\nbor{\mathcal{N}}
\newcommand{\suppress}[1]{}

\def\eps{\varepsilon}
\def\given{\mid}
\def\ggiven{\mid\mid}
\newcommand{\DKL}[2]{\mathrm{D}_{\mathrm{KL}}(#1 || #2)}


\newcommand\bijan[1]{\textcolor{red}{B: #1} }%

\newcommand\mila[1]{\textcolor{blue}{Mila: #1} }% Notes for Mila

\newcommand\dominik[1]{\textcolor{brown}{Dominik: #1}}% Notes for Dominik

\newcommand\atalanti[1]{\textcolor{magenta}{Atalanti: #1} }% Notes for Atalanti

\newcommand\leena[1]{\textcolor{orange}{Leena: #1} }% Notes for Leena
% \newcommand\bijan[1]{}%

% \newcommand\mila[1]{}% Notes for Mila

% \newcommand\dominik[1]{}

% \newcommand\atalanti[1]{}

% \newcommand\leena[1]{}% Notes for Leena
% Commands for types of active paths

% General Active path - label over path gives restricted set the path can go through

\usetikzlibrary{decorations,decorations.pathmorphing, decorations.pathreplacing, decorations.shapes, arrows, arrows.meta}

\pgfdeclaremetadecoration{middlezigzag}{straight}{
    \state{straight}[switch if less than=\pgfmetadecorationsegmentlength to final,
                  width=\pgfmetadecoratedpathlength/2 - \pgfmetadecorationsegmentlength/2 ,
                  next state=zigzag] {
        \decoration{curveto}
    }
    \state{zigzag}[width=\pgfmetadecorationsegmentlength,
                   next state=final] {
        \decoration{zigzag}
    }
    \state{final}{
        \decoration{curveto}
        \beforedecoration{\pgfpathmoveto{\pgfpointmetadecoratedpathfirst}}
    }
}


\tikzset{
    middle zigzag/.style={
        decorate,
        decoration={
            middlezigzag,
            meta-segment length=.36cm,
            segment length=0.12cm,
            amplitude = .05cm,
        }
    }
}


\newcommand\activepathNN{
\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](.25,0) -- (.37,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.37,0);
\end{tikzpicture}}}
\newcommand\activepathLN{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](.25,0) -- (.37,0);
    \draw [-{to [width=1.2mm, length=1.2mm]}](-.25,0) -- (-.33,0);
\end{tikzpicture}}}
\newcommand\activepathRN{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](.25,0) -- (.37,0);
    \draw [-{to [reversed, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.33,0);
\end{tikzpicture}}}

\newcommand\activepathNR{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.37,0);
    \draw [-{to [width=1.2mm, length=1.2mm]}](.25,0) -- (.33,0);
\end{tikzpicture}}}

\newcommand\activepathLR{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{to[width=1.2mm, length=1.2mm]}](.25,0) -- (.33,0);
    \draw [-{to [width=1.2mm, length=1.2mm]}](-.25,0) -- (-.33,0);
\end{tikzpicture}}}
\newcommand\activepathRR{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{to[width=1.2mm, length=1.2mm]}](.25,0) -- (.33,0);
    \draw [-{to [reversed, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.33,0);
\end{tikzpicture}}}

\newcommand\activepathNL{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{to[reversed, width=1.2mm, length=1.2mm]}](.25,0) -- (.33,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.37,0);
\end{tikzpicture}}}
\newcommand\activepathLL{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{to[reversed, width=1.2mm, length=1.2mm]}](.25,0) -- (.33,0);
    \draw [-{to [width=1.2mm, length=1.2mm]}](-.25,0) -- (-.33,0);
\end{tikzpicture}}}
\newcommand\activepathRL{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [middle zigzag] (-.25,0) -- (.25,0);
    \draw [-{to[reversed, width=1.2mm, length=1.2mm]}](.25,0) -- (.33,0);
    \draw [-{to [reversed, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.33,0);
\end{tikzpicture}}}

\newcommand\activepathNB{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}] (-.25,0) -- (-.1,0);
    \draw [-{Bar[width = 2mm]}](-.1,0) -- (-.05,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.44,0);
\end{tikzpicture}}}

\newcommand\activepathBN{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}] (.25,0) -- (.1,0);
    \draw [-{Bar[width = 2mm]}](.1,0) -- (.05,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](.25,0) -- (.44,0);
\end{tikzpicture}}}

\newcommand\activepathNC{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}] (-.25,0) -- (-.1,0);
    \draw [-{to[width = 2mm]}](-.1,0) -- (-.05,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](-.25,0) -- (-.44,0);
\end{tikzpicture}}}

\newcommand\activepathCN{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}] (.25,0) -- (.1,0);
    \draw [-{to[width = 2mm]}](.1,0) -- (.05,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}](.25,0) -- (.44,0);
\end{tikzpicture}}}

\newcommand\activepathCR{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}] (.25,0) -- (.1,0);
    \draw [-{to[width = 2mm]}](.1,0) -- (.05,0);
    \draw [-{to[width=1.2mm]}](.25,0) -- (.44,0);
\end{tikzpicture}}}

\newcommand\activepathCL{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}] (.25,0) -- (.1,0);
    \draw [-{to[width = 2mm]}](.1,0) -- (.05,0);
    \draw [-{to[, reversed, width=1.2mm]}](.25,0) -- (.44,0);
\end{tikzpicture}}}

\newcommand\doubleactivepathNB{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}, double distance=.2mm] (-.25,0) -- (-.1,0);
    \draw [] (-.242,-.02) -- (-.218,0.02]);
    \draw [] (-.24,-.017) -- (-.25, -.017);
    \draw [-{Bar[width = 2mm]}, double distance=.2mm](-.1,0) -- (-.05,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}, double distance=.2mm](-.25,0) -- (-.44,0);
\end{tikzpicture}}}
\newcommand\doubleactivepathBN{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}, double distance=.2mm] (.25,0) -- (.1,0);
    \draw [] (.242,.02) -- (.218,-0.02]);
    \draw [] (.24,.017) -- (.25, .017);
    \draw [-{Bar[width = 2mm]}, double distance=.2mm](.1,0) -- (.05,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}, double distance=.2mm](.25,0) -- (.44,0);
\end{tikzpicture}}}

\newcommand\doubleactivepathNC{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}, double distance=.2mm] (-.25,0) -- (-.1,0);
    \draw [] (-.242,-.02) -- (-.218,0.02]);
    \draw [] (-.24,-.017) -- (-.25, -.017);
    \draw [-{to[width = 2mm]}](-.055,0) -- (-.05,0);
    \draw [-, double distance=.2mm](-.08,0) -- (-.1,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}, double distance=.2mm](-.25,0) -- (-.44,0);
\end{tikzpicture}}}
\newcommand\doubleactivepathCN{\mathrel{
\begin{tikzpicture}[baseline={([yshift=-.8ex]current bounding box.center)}]
    \draw [decorate, decoration={zigzag, segment length=.12cm, amplitude=.05cm}, double distance=.2mm] (.25,0) -- (.1,0);
    \draw [] (.242,.02) -- (.218,-0.02]);
    \draw [] (.24,.017) -- (.25, .017);
    \draw [-{to[width = 2mm]}](.055,0) -- (.05,0);
    \draw [-, double distance=.2mm](.08,0) -- (.1,0);
    \draw [-{Circle[open, width=1.2mm, length=1.2mm]}, double distance=.2mm](.25,0) -- (.44,0);
\end{tikzpicture}}}

\makeatletter
\newcommand{\bigcomp}{%
  \DOTSB
  \mathop{\vphantom{\sum}\mathpalette\bigcomp@\relax}%
  \slimits@
}
\newcommand{\bigcomp@}[2]{%
  \begingroup\m@th
  \sbox\z@{$#1\sum$}%
  \setlength{\unitlength}{0.9\dimexpr\ht\z@+\dp\z@}%
  \vcenter{\hbox{%
    \begin{picture}(1,1)
    \bigcomp@linethickness{#1}
    \put(0.5,0.5){\circle{1}}
    \end{picture}%
  }}%
  \endgroup
}
\newcommand{\bigcomp@linethickness}[1]{%
  \linethickness{%
      \ifx#1\displaystyle 2\fontdimen8\textfont\else
      \ifx#1\textstyle 1.65\fontdimen8\textfont\else
      \ifx#1\scriptstyle 1.65\fontdimen8\scriptfont\else
      1.65\fontdimen8\scriptscriptfont\fi\fi\fi 3
  }%
}

\makeatother


\def\tpose{\top}
\def\vv{\mathbf{v}}
\def\TT{\mathbf{T}}
\def\G{\mathcal{G}}
\def\Tin{\vec{T}_{\mathrm{in}}}
\def\Tout{\vec{T}_{\mathrm{out}}}
\def\tin{T_{\mathrm{in}}}
\def\tout{T_{\mathrm{out}}}
\def\Ain{\vec{A}_{\mathrm{in}}}
\def\Aout{\vec{A}_{\mathrm{out}}}
\def\Vmissing{\vec{V}^{\mathrm{MISS}}}
\def\Vextra{\vec{V}^{\mathrm{EXTRA}}}
\def\Ugood{\vec{U}^{\mathrm{GOOD}}}
\def\Ugoodsap{\vec{U}^{\mathrm{GOOD}}_{\mathrm{SAP}}}
\def\Ubadsap{\vec{U}^{\mathrm{BAD}}_{\mathrm{SAP}}}
\def\Ubad{\vec{U}^{\mathrm{BAD}}}
\def\Uperp{\vec{U}^\perp}
\def\Ustar{\vec{U}^*}
\def\Vgood{\vec{V}^{\mathrm{GOOD}}}
\def\Vbad{\vec{V}^{\mathrm{BAD}}}
\def\Vambig{\vec{V}^{\mathrm{AMBIG}}}
\def\Ssub{\vec{S}^{\mathrm{SUB}}}
\def\Ssup{\vec{S}^{\mathrm{SUP}}}
\def\E{\mathbb{E}}
\def\Vsap{V_{\mathrm{SAP}}}
\newcommand{\Fisolate}[2]{F_{\mathrm{ISO}(#1)}(#2)}
\newcommand{\TFisolate}[2]{\tilde{F}_{\mathrm{ISO}(#1)}(#2)}
\newcommand{\Fremove}[2]{F_{\mathrm{REM}(#1)}(#2)}

\usepackage{subfiles} % Best loaded last in the preamble

\begin{document}
\maketitle
\begin{abstract}
Statistical prediction models are often trained on data that is drawn from different probability distributions than their eventual use cases. One approach to proactively prepare for these shifts harnesses the intuition that causal mechanisms should remain invariant between environments. Here we focus on a challenging setting in which the causal and anticausal variables of the target are unobserved. Leaning on information theory, we develop feature selection and engineering techniques for the observed downstream variables that act as proxies. We identify proxies that help to build stable models and moreover utilize auxiliary training tasks to extract stability-enhancing information from proxies. We demonstrate the effectiveness of our techniques on synthetic and real data.
\end{abstract}


\section{Introduction}
The principle assumption when building any (not necessarily causal) prediction model is access to relevant data for the task at hand. When predicting label $Y$ from inputs $\vec{X}$, this assumption reads that the data is drawn from a (training) probability distribution $\vec{X}, Y$ 
that is identical to the distribution that will generate its use-cases (target distribution).

Unfortunately, the dynamic nature of real-world  systems
makes obtaining perfectly relevant data difficult. Data-gathering mechanisms can introduce sampling bias, yielding distorted training data. Even in the absence of sampling biases, populations, environments, and interventions give rise to distribution shifts in their own right. For example,~\cite{chest_xray_spurious} found that convolutional neural networks to detect pneumonia from chest radiographs often relied on site-specific features, including the metallic tokens indicating laterality and image processing techniques. This resulted in poor generalization across sites. Understanding these inter-site breakdowns in performance is essential to safety-critical domains such as healthcare.

\paragraph{Transportability and Domain Generalization}
The first attempts at handling dissociation between training and target distributions involved gathering unlabeled samples of the testing distribution. Within domain generalization (DG), covariate shift handles a shift in the distribution of $\vec{X}$ \citep{shimodaira2000improving} and label shift handles a shifting $\Pr(Y)$ \citep{schweikert2008empirical}. DG often assumes a stationary label function $\Pr(Y \given \vec{X})$, which is extremely limiting in real-life applications. 


To address these limitations, one can assume the label function is stationary for a \textit{subset} of the covariates in $\vec{X}$, called an \textbf{invariant set} in \cite{muandet2013domain} and \cite{rojas2018invariant}. The \textbf{transportability} problem concerns itself with finding such an invariant set $\vec{X}$. 

One approach to transportability has been to capture shifting information from a collection of datasets \citep{rojas2018invariant, invariant_feature_selection}. Such techniques require access to a comprehensive set of datasets that represent all possible shiftings.
A causal perspective developed in \cite{storkey2009training} and \cite{pearl2011transportability} instead uses graphical modeling via \textbf{selection diagrams} to model shifting mechanisms. This approach requires access to multiple datasets to learn these mechanisms, but does not require that those datasets span the entire space of possible shifting. Such approaches also allow the use of domain expert knowledge when building selection diagrams. A detailed comparison of stability in the causal and anticausal scenario is given in \cite{scholkopf2012causal}.

% Leena suggested example

\paragraph{Contributions}
The causal perspective to distribution shift is obscured when we lack direct measurements of the causes and effects of $Y$.  Such settings arise from noisy measurements, privacy concerns, as well as abstract concepts that cannot be easily quantified (such as ``work ethic'' or ``interests''). Instead, we will focus on a setting where we only measure \textit{proxies} for the causes and effects of $Y$, see Fig.~\ref{fig:example context} for an example. All of these proxies are descendants of $\vec{U}$ -- a case which is common in medicine, where the measured variables are often blood markers (or other tests) that are indicative of an underlying condition.

The proxy setting is difficult to address in standard transportability framework. While previous approaches to partially observed systems suggest restricting model inputs to those on stable paths \citep{subbaswamy2018counterfactual}, no observed proxies satisfy this condition in our setting. That is, even if probability of $Y$ given its unobserved causes is invariant, the probability of $Y$ given the observed proxies may vary, along with the marginal probability of those proxies.

We will use concepts from causal inference and information theory to define and study the \textbf{proxy-based transportability} problem. Our framework will demonstrate that perfection is indeed the enemy of good -- some variables (although with an unstable relationship to the target) should still be included as features to build a model with improved stability.

A primary goal of this paper will be to distinguish between proxies that are ``helpful'' or ``hurtful'' for stability - a property that they inherit from the causal and anti-causal variables whose information they contain. The stability of these unobserved variables depends on the transportability of their causal structure, which is unobserved. We will present a strategy for feature selection based on properties that propagate from the underlying causal structure to its observed proxies. 
Specifically, we will build on the observation that post-selecting on a single value of the prediction label $Y$ induces a special independence structure, which the proxies for the causes and effects of $Y$ also inherit. We use this to classify proxies from partial knowledge of a few ``seeds'' - a technique we call \textbf{proxy bootstrapping}.

It is possible that some proxy variables will contain information about both stable and unstable hidden variables. We call these \textbf{ambiguous proxies} because it is unclear whether they will improve or worsen the model's transportability. Inspired by node splitting~\citep{subbaswamy2018counterfactual}, we introduce a method we call \textbf{causal information splitting (CIS)}, which can improve stability of our models at no cost (and even some benefit) to the distribution shift robustness. Again exploiting the inherited independence structure from post-selecting on $Y$, CIS isolates stabilizing information using seemingly unrelated auxiliary prediction tasks on the covariates. While theoretical guarantees require a number of assumptions, we demonstrate the surprising ability of CIS to separate stabilizing information from ambiguous variables on synthetic data experiments with relaxed assumptions. Furthermore, we demonstrate CIS's potential on U.S. Census data which were strongly shifted due to the COVID-19 pandemic. While plenty of experiments have confirmed that techniques for robust models do not consistently provide benefits over empirical risk minimization~\citep{gulrajani2021in}, our proposed technique provides benefits for an income prediction task in the majority of tested states.

\section{Related Work}\label{sec:related work}
Apart from work on transportability, there is an increasing body of work on domain generalization, see~\cite{quinonero2008dataset} for an overview. 
While we focus on proactively modeling shifts, work on invariant risk minimization~\citep{arjovsky2019invariant,bellot2020generalization} has approached this problem when given access to the shifted data on which the models will be used. Recent work further generalizes to unseen environments constituting mixtures~\citep{sagawa2019distributionally} and affine combinations~\citep{krueger2021out}. Data from multiple environments can also be used for causal discovery~\citep{invariant_prediction, heinze2018invariant, peters2016causal}.

Another line of work seeks robustness to small adversarial changes in the input that should not change the output (with attacks, e.g.~\cite{Croce_adv_eval_attacks} and defenses, e.g.~\cite{sinha2018certifiable}). 
% approach transportability from the \textbf{adversarial robustness} point of view, where the goal is to preserve the prediction despite small adversarial changes to the input, with a long line of attacks and defenses respectively. 
Moving from small changes to potentially bigger interventions, work on counterfactual robustness and invariance, introduces additional regularization terms~\citep{veitch2021counterfactual, CounterfactuallyInvPredictors}. 
 Our work differs by allowing for interventions that  change  the label.

We do not address the tradeoffs associated with robustness and model accuracy in this paper. Such tradeoffs are a natural consequence of restricting the input information for our model, since unstable information is still useful in unperturbed cases. This problem is generally addressed by \cite{oberst2021regularizing} via regularization. 

\section{Background}

\paragraph{General Notation}
Uppercase letters denote random variables, while lowercase letters denote assignments to those random variables. Bold letters denote sets/vectors. 
The paper will use concepts from information theory, with $\H(A)$ indicating the \textbf{entropy} of $A$, $\I(A : B)$ indicating the \textbf{mutual information} between $A, B$, and $\I(A:B:C)$ indicating the \textbf{interaction information} between $A, B, C$. A short summary of key ideas (including the data processing inequality (DPI) and chain rule) is given in Appendix B (see \cite{cover1999elements} for more details).

\paragraph{Causal Graphical Models}
Graphically modeling distribution shift makes use of causal DAGs. For a causal DAG $\G = (\vec{V}, \vec{E})$, the joint probability distribution factorizes according to the local Markov condition,
\begin{equation*}
 \Pr(\vec{v}) = \prod_{v \in \vec{v}} \Pr(v \given \pa_{\vec{v}}^\G(V)).
\end{equation*}

$\Pa^{\G}(V), \Ch^\G(V)$ denote the parents and children of $V$ in $\G$. Following the uppercase/lowercase convention, $\pa_{\vec{v}}(V)$ is an assignment to $\Pa(V)$ using the values in $\vec{v}$.\footnote{$\Pa(V) \subseteq \vec{V}$} $\De^{\G}(V)$ and $\An^{\G}(V)$ denote the descendants and ancestors respectively. $\Fm(V) = \Pa(V) \cup \Ch(V)$ denotes the ``family.''

We will rely on the concepts of $d$\textbf{-separation} and \textbf{active paths} to discuss the independence properties of Bayesian networks, which are discussed in Appendix A. See \cite{pearl2009causality} for a more extensive study.

\paragraph{Active Path Notation}
In addition to using $A \indep_d B \given C$ to indicate $d$-separation conditioned on $C$, we will develop a notation to refer to sets of variables that act as ``switches'' for $d$-separation. $\vec{A} \doubleactivepathNB \vec{C} \doubleactivepathBN \vec{B}$ means that we have both $\vec{A} \not \indep_d \vec{B}$ and $\vec{A} \indep_{d} \vec{B} \given \vec{C}$. Conversely, we have $\vec{A} \doubleactivepathNC \vec{C} \doubleactivepathCN \vec{B}$ if $\vec{A} \indep_d \vec{B}$, but $\vec{A} \not \indep_d \vec{B} \given \vec{C}$ (i.e. conditioning on $\vec{C}$ renders $\vec{A}$ and $\vec{B}$ $d$-connected).

%We will use $A \activepathNN B$ to denote the statement that $A$ and $B$ are connected by at least one active path. We can specify the conditioning set $\vec{Z}$ in this notation using $A \activepathNN B \given \vec{Z}$.

%When composing active paths, $A \activepathNN C$ and $C \activepathNN B$, the formation of $A \activepathNN B$ is dependent on the direction of the edges to $C$ in both paths. For this reason, we will sometimes specify the direction of the final edge of a path. For example, $A \activepathLN B$ means there exists an active path from $A$ to $B$ where first edge of the active path points into $A$. $A \activepathRN B$ or $B \activepathNL A$ means that the first edge of the active path points out of $A$. $A \activepathNB C \activepathBN B$ indicates that an active path exists which can be blocked by conditioning on $C$. $A \activepathNC C \activepathCN B$ states that there exists an inactive path which is unblocked via conditioning on $C$. When reasonable, this notation can be extended to sets of vertices. The interpretation for sets is $\vec{A} \activepathNN \vec{B}$ if and only if $\exists A \in \vec{A}$ and $\exists B \in \vec{B}$ such that $A \activepathNN B$.

%Finally, it will be useful to denote sets of variables which act as ``switches'' for $d$-separation. $\vec{A} \doubleactivepathNB \vec{C} \doubleactivepathBN \vec{B}$ means that we have both $\vec{A} \activepathNN \vec{B}$ and $\vec{A} \indep_{d} \vec{B} \given \vec{C}$. Conversely, we have $\vec{A} \doubleactivepathNC \vec{C} \doubleactivepathCN \vec{B}$ if $\vec{A} \indep_d \vec{B}$, but $\vec{A} \activepathNN \vec{B} \given \vec{C}$ (i.e. conditioning on $\vec{C}$ renders $\vec{A}$ and $\vec{B}$ $d$-connected).

\paragraph{Graphically Modeling Distribution Shift}
Borrowing terms from \cite{invariant_feature_selection}, we will begin with a graphical model $\G=(\vec{V} \cup \vec{U})$, calling $\vec{U} \cup \vec{V}$ the \textbf{system variables} with (un-)observed variables. 
In addition, we are also given a set of \text{context variables} $\vec{M}$, which model the mechanisms that shift our distribution. 
The augmentation of $\G$ with $\vec{M}$ gives what we call the \textbf{distribution shift diagram} (DSD), $\G^+ = (\vec{V} \cup \vec{U} \cup \vec{M}, \vec{E} \cup \vec{E}_{\vec{M}})$, for which $\G$ is a subgraph, with additional vertices $\vec{M}$ introducing shifts along $\vec{E}_{\vec{M}}$. 
The transportability problem \citep{pearl2011transportability} involves finding an input set $\vec{X} \subseteq \vec{V}$ such that $ \Pr(Y \given \vec{X}) = \Pr(Y \given \vec{X}, \vec{M})$.
Such a set $\vec{X}$, called an ``invariant set'' in \cite{invariant_feature_selection}, blocks all possible influence from the mechanisms of the dataset shift. \cite{pearl2011transportability} shows this framework is capable of modeling sampling bias and population shift.

\section{Setting}
This paper will consider the \textbf{proxy-based transportability (PBT)} setting. PBT focuses on the role of proxy variables in feature selection by assuming \emph{all} of the causes and effects $\vec{U} = \Fm(Y)$ are unobserved.\footnote{This assumption is not necessary  but allows us to focus on more difficult questions that have not been answered by previous work. Namely, direct causes and effects can be visible or have perfect proxies without changing the results of the paper.} We are given access to a list of ``visible proxy variables'' $\vec{V} \setminus \{ Y \}$ which are descendants of at least one $U \in \vec{U}$.
Hence, $\vec{V}$ can be thought of as the union of overlapping subsets $\Ch(U)$ for each $U \in \vec{U}$. %\leena{Shouldn't $V$ be a \textit{subset} of the union of $\Ch(U)$ for each $U \in \vec{U}$?}.
%\mila{addressed in next sentence - removing for now}

We will assume that there are no edges directly within $\vec{U}$ or within $\vec{V}$, which we call \textbf{systemic sparsity}. See Figure~\ref{fig:example context} for an example of this setting. This assumption enforces two useful independence properties: (1) $V_i \indep V_j \given U$ for $V_i, V_j \in \Ch(U)$ and (2) $U_i \indep U_j \given Y$ for $U_i \neq U_j \in \vec{U}$. Systemic sparsity guarantees that a discoverable causal structure exists within the unobserved variables and simplifies the interactions between the proxies.

We will build our theory on distribution shift diagrams $\G^+ = (\vec{V} \cup \vec{U} \cup \vec{M},  \vec{E} \cup \vec{E}_{\vec{M}})$  with one $M_i \in \vec{M}$ connected to a corresponding $U_i \in \vec{U}$. 
Each $M_i$ models a different shifting mechanism for each unobserved cause and effect of $Y$. It is common to assume there is no direct shifting mechanism acting on $Y$ - which comes without loss of generality since such a mechanism can be thought of as another unobserved cause \citep{pearl2011transportability, invariant_prediction}.

In this setting, a perfect invariant set $\vec{X}$ in which $Y \indep_d \vec{M} \given \vec{X}$ does not exist. Proxy-based transportability will instead seek to minimize the influence of the context variables on our label function. Borrowing concepts from information theory, the task in the proxy-based transportability problem corresponds to finding a set of features $\vec{X}$ that minimizes the conditional mutual information between the label and the environment. We call this quantity, $\I(Y:\vec{M} \given \vec{X})$, the \textbf{context sensitivity}. To allow for feature engineering, we define these features to be the output of a function, $\vec{X} = F(\vec{V} \setminus \{Y\})$ which can capture higher-level representations of $\vec{V} \setminus \{Y\}$.

\begin{figure*}[tb]
 \centering
 (a)\scalebox{.5}{
 \begin {tikzpicture}[-latex ,auto ,node distance =2 cm and 2 cm ,on grid , thick, state/.style ={ circle, draw, minimum width =.85 cm}, cstate/.style ={ circle, draw, minimum width =.8 cm, ultra thick}]
  \node[state, accepting, color=green!50!black] (MU1) [] {$M_{1}$};
  \node[state, dashed, color=blue] (U1) [below right=of MU1] {$U_1$};
  \node[state] (V1)[below left = 3cm and 2cm of U1] {$V_1$};
  \node[state] (V2)[right = of V1] {$V_2$};
  \node[state] (V3)[ right = of V2] {$V_3$};
  \node[state] (V4)[ right = of V3] {$V_4$};
  \node[state] (V5)[ right = of V4] {$V_5$};
  \node[state] (V6)[ right = of V5] {$V_6$};
  \node[state] (V7)[ right = of V6] {$V_7$};
  \node[state, dashed, color=blue] (U2) [right = 4cm of U1] {$U_2$};
  \node[state, accepting, color=green!50!black] (MU2) [above left = of U2] {$M_{2}$};
  \node[state, dashed, color=blue] (U3) [right = 4cm of U2] {$U_3$};
  \node[state, accepting, color=green!50!black] (MU3) [above right = of U3] {$M_{3}$};
  \node[state] (Y) [ above = 4cm of U2] {$Y$};
  \path[very thick](U1) edge (V1) (U1) edge (V2) (U1) edge (V3) (U2) edge (V3) (U2) edge (V4) (U2) edge (V5) (U3) edge (V5) (U3) edge (V6) (U3) edge (V7);
  \path[very thick](U1) edge (V6) (U3) edge (V2) (U2) edge (V6);
  \path[very thick](U1) edge[bend left=30] (Y) (Y) edge (U2) (Y) edge[bend left=30] (U3);
  \path[very thick, color=green!50!black] (MU1) edge (U1) (MU2) edge (U2) (U3) edge (MU3);
 \end{tikzpicture}
 }
 (b)\scalebox{.67}{
 \begin {tikzpicture}[-latex ,auto ,node distance =2 cm and 2 cm ,on grid , thick, state/.style ={ rectangle, draw, minimum width =1 cm, thick, minimum height = .8cm}]
  \node[state] (Y) {Income};
  \node[state, green!50!black, double] (COVID) [below left = 1.5cm and 2.5cm of Y] {Pandemic};
  \node[state, dashed, blue] (Interests)[below left = 3cm and 4.5cm of Y] {Interests};
  \node[state, dashed, blue, minimum width =.8 cm] (Employment)[below left = 3cm and 1.5cm of Y]  {Employment};
  \node[state, dashed, blue, minimum width =.8 cm] (Residence)[below right = 3cm and 1.5cm of Y]  {Residence};
  \node[state, dashed, blue] (Eligibility)[below right = 3cm and 4.5cm of Y]  {Medicaid Eligibility};
  \node[state] (Commute)[below = 5cm of Y]  {Commute};
  \node[state] (MedicaidStatus)[below right = 5cm and 3cm of Y]  {Medicaid Status};
  \node[state] (Education)[below left = 5cm and 3cm of Y]  {Education};
  \path (Interests) edge[bend left = 30] (Y) (Employment) edge[bend left =10] (Y);
  \path (Y) edge[bend left = 10] (Residence) (Y) edge[bend left = 30] (Eligibility);
  \path (Interests) edge (Education);
  \path (Employment) edge (Commute) (Residence) edge (Commute);
  \path (Employment) edge (MedicaidStatus) (Eligibility) edge (MedicaidStatus);
  \path[green!50!black] (COVID) edge (Interests) (COVID) edge (Employment) (COVID) edge (Residence) (COVID) edge (Eligibility);
 \end{tikzpicture}}
 \caption{Examples of the $\G^+$ considered for the paper. (a) shows a generic setup where $U_1$ is a hidden cause of $Y$, and $U_2, U_3$ are hidden effects. (b) shows a \textit{plausible} model explaining the success of our real-data experiment in Section~\ref{sec:real_data}.}
 \label{fig:example context}
\end{figure*}


\paragraph{Challenges in PBT}
The PBT setting is difficult to address using existing methods for transportability. Building a model on the causes $\Pa(Y)$ as in \cite{scholkopf2012causal} is impossible because all of the causes are unobserved. Furthermore, finding a separating set as in \cite{invariant_feature_selection, pearl2011transportability} is also impossible for the same reason.
Proxies can contain combinations of both stable and unstable information when they are connected to multiple $U \in \vec{U}$. Introduced in \cite{subbaswamy2018counterfactual}, ``node splitting'' requires knowledge of the structural equations that govern a vertex  to remove unstable information from ambiguous variables, which can only be learned if the causes of the split node are observed. This requirement limits node splitting's power in the proxy setting. 

\subsection{Invertible Dropout Functions}
We will demonstrate the failure of existing transportability approaches in this setting using a counterexample built on structural equations models with cleanly interpretable entropic relationships. This construction will show the cost of restricting features to those with stable paths to the prediction variable $Y$, and serve as a framework for understanding the problem in general. For a discussion of relaxations, see Sec.~\ref{sec:conclusion} and for a demonstration that our method can work in real-world settings (where the assumption does not hold), see Sec.\ref{sec:aux}.

Our restricted structural equations give edges from $A$ to $B$ described by an invertible function with ``dropout'' noise,
\begin{equation} \label{eq: components structural eq}
    B^{(A)}(A) = \begin{cases}
        \mathcal{T}_{A,B}(A) &\text{with probability } \alpha_{A, B}\\
        \phi &\text{with probability } 1- \alpha
    \end{cases}.
\end{equation}
$\mathcal{T}_{A,B}(\cdot)$ is a function that is invertible, with $\mathcal{T}_{A,B}(\phi) = \phi$. The probability that information from the parent is preserved is given by $\alpha_{A,B} \in [0, 1]$. We will refer to $B^{(A)}(A) \neq \phi$ as ``transmission,'' and $\alpha_{A,B}$ as the ``probability of transmission.''\footnote{The direction of the edge for these $\alpha_{A,B}$ will sometimes be arbitrary, in which case the ordering of the vertices is unimportant.} $\phi$, called ``null'', is a value that represents the dropout, or the failure of the edge to ``transmit''. 

The structural equation for a vertex $B$ given its parents is a deterministic function of these $B^{(A)}$,
\begin{equation}
    B = \mathcal{T}_B(\{B^{(A)}(A) \text{ for } A \in \Pa(B)\}),
\end{equation}
where $\mathcal{T}_B$ is not necessarily an invertible function.

For functions with many children, the probability that at least one of their children transmits is
\begin{equation}
    \alpha_{A,\Ch(A)} := 1 - \prod_{B \in \Ch(A)}(1- \alpha_{A,B}).
\end{equation}

\paragraph{Separability and Faithfulness}
If $\mathcal{T}_B$ is invertible, we say that $B$ is a separable variable, which means that a child $B$ with more than one parent can be split into separate disconnected vertices $B^{(A)}$ for $A \in \Pa(B)$, each with the structural equation given by Equation~\ref{eq: components structural eq} (See Figure~\ref{fig:example splitting}). Separable variables make up a special violation of faithfulness in that conditioning on separable colliders no longer opens up active paths, illustrated by Lemma~\ref{lem: d conn but not dependent}.

\begin{lemma}[Separability violates faithfulness] \label{lem: d conn but not dependent}
If $U_1 \doubleactivepathNC V \doubleactivepathCN U_2$ and $V$ is separable, then $U_1 \not \indep_d U_2 \given V$, but $U_1 \indep U_2 \given V$.
\end{lemma}
The proof follows from the definition of mutual information and the fact that $U_1 \indep U_2 \given V$.

\begin{figure}[t]
 \centering
  \scalebox{.5}{
 \begin {tikzpicture}[-latex ,auto ,node distance =2 cm and 2 cm ,on grid , thick, state/.style ={ circle, draw, minimum width =.85 cm}, cstate/.style ={ circle, draw, minimum width =.8 cm, ultra thick}]
\filldraw[color=black, fill=black!5, very thick](-.8,-1) rectangle (2.8, -3)
            node[right] {$B$};
  \node[state] (A1) {$A_1$};
  \node[state] (A2)[right=of A1] {$A_2$};
  \node[state] (BA1)[below = of A1] {$B^{(A_1)}$};
  \node[state] (BA2)[below = of A2] {$B^{(A_2)}$};
  \path[very thick] (A1) edge (BA1) (A2) edge (BA2);
  \node[state] (A1a)[left = 6cm of A1] {$A_1$};
  \node[state] (A2a)[right=of A1a] {$A_2$};
  \node[state, fill=black!5] (B)[below right = 2cm and 1cm of A1a] {$B$};
  \path[very thick] (A1a) edge (B) (A2a) edge (B);
 \end{tikzpicture}
 }
 \caption{A diagram showing separability.} \label{fig:example splitting} 
\end{figure}

Our setting will rely on the assumption of faithfulness of the sub-graph on the $\vec{U} \cup \{Y\}$ vertices for proxy bootstrapping, as is the case for algorithms attempting any degree of structure learning. Specifically, we will require that any active path between two proxies $V_i, V_j$ that does not travel through any other vertices in $\vec{V}$ must imply statistical dependence (we call this ``partial faithfulness''). When we move to causal information splitting, we will allow \emph{specific} violations of faithfulness that come from separable proxies $\vec{V}$ in order to illustrate an ideal use-case of our method. This does not contradict partial faithfulness.

\paragraph{Transmitting Active Paths}
A convenient aspect of these structural equations is that $\alpha_{AB}$ controls the mutual information between $A$ and its child $B^{(A)}$,
\begin{equation*}
\begin{aligned}
    \I(A:B^{(A)}) =& \H(A) - \H(A \given B^{(A)})\\
    =& \H(A) - \Pr(B^{(A)} = \phi) \H(A \given B^{(A)}= \phi)\\
    &- \Pr(B^{(A)} \neq \phi) \H(A \given B^{(A)} \neq \phi)
\end{aligned}
\end{equation*}
An important insight is that $\H(A \given B^{(A)}= \phi) = 0$ and $\H(A \given B^{(A)} \neq \phi) = \H(A)$. Applying this gives,
\begin{equation*}
\begin{aligned}
    \I(A:B^{(A)}) = \H(A) - (1-\alpha_{A,B})\H(A) = \alpha_{A,B} \H(A).
\end{aligned}
\end{equation*}
This aspect generalizes to active paths. For a length-2 path $A \rightarrow B \rightarrow C$, $
    \I(A:C) = \I(A:C^{(B)}) = \H(A) - \H(A \given C^{(B)})$.
Again, we can break up $\H(A \given C^{(B)})$ into $\H(A \given C^{(B)} = \phi) = 0$ and $\H(A \given C^{(B)} \neq \phi) = \H(A)$. Hence, reasoning about mutual information reduces to the task of determining the probability that one of the endpoints is null. In our setup, the dropout events of different edges are independent events. Hence, $\I(A:C) = \alpha_{A,B}\alpha_{B,C} \H(A)$.

Conditioning adds an additional complication. Notice that transmitting active paths can ``transfer'' a conditioning. That is, $\H(A \given x) = 0$ when there is only one active path between $A$ and $X$ (or $X$ to $A$) and it transmits. In the next section, we will study two cases that emerge in the PBT problem: colliders and non-colliders.

\section{Context Sensitivity}
\label{sec:context_sensitivity}
We quantify robustness through the dependence on environmental mechanisms and the label function.

\begin{definition}[Context sensitivity]
Context sensitivity of a mechanism $M \in \vec{M}$  is defined as $\I(Y:M \given \vec{X})$. 
\end{definition}
If $\vec{X}$ $d$-separates $\vec{M}$ from $Y$, the context sensitivity is 0 and training on $\vec{X}$ to predict $Y$ yields a model that is robust across environments $\vec{M}$.

We are usually most concerned with the success of our prediction models, something that is limited by the ``relevance'', $\I(Y:\vec{X})$, of our input. This concept is related to context sensitivity, and we can rewrite the sensitivity in terms of the expected relevance across environments.
\begin{equation*}
\begin{split}
    \I(Y:M \given \vec{X}) = \I(Y:M) - \I(Y:M :\vec{X}) \\
    = \I(Y:M) - \I(Y:\vec{X}) + \I(Y:\vec{X} \given M) .
\end{split}
\end{equation*}

\subsection{Redundancy}
Recall that in our setting we assume that all direct causes and effects are unobserved. This unobserved set of parents
gives rise to an invariant set $\vec{S} \subseteq \vec{U}$\footnote{The Markov boundary of $Y$ would also give an invariant set, but could include vertices in $\vec{M}$ that are parents of effects of $Y$.}. We seek to identify a  subset of visible proxies $\vec{X} \subseteq \vec{V}$ to extract information about $\vec{S}$.

\begin{definition}
    For a specific $U$, we call $\I(U:\vec{X}) = \H(U) - \H(U \given \vec{X})$ the \textbf{redundancy} between $U$ and $\vec{X}$.
\end{definition}
\begin{lemma} \label{lem: setting redundancy}
In the dropout function setting, 
let $\Ch_{\vec{X}}(U) := \Ch(U) \cap \vec{X}$.
    \begin{equation*}
        \I(U:\vec{X}) = \alpha_{U,\Ch_{\vec{X}}(U)} \H(U).
    \end{equation*}
\end{lemma}
Redundancy in the dropout function setting is controlled by our choice of $\vec{X}$ via $\alpha_{U,\Ch_{\vec{X}}(U)}$, the probability of transmission to at least one child. 

Our graphical assumptions ensure that only one potential active path exists between each $M \in \vec{M}$ and $Y$ - hence each vertex acts as either a collider or a non-collider in the interaction of $M$ and $Y$ (and does not do both). We now demonstrate that redundancy with stable (non-collider) variables generally improves our context sensitivity, whereas redundancy with unstable (collider) variables worsens it. 
 
\paragraph{``Good'' $\vec{U}$} If $M_i$ and $Y$ do not form a collider at $U_i \in \vec{U}$, we say $U_i \in \Ugood$. From $d$-separation, we have that $M_i \indep_d Y \given U_i$ for all $U_i \in \Ugood$. For an example, $\Ugood = \{U_1, U_3\}$ in Figure~\ref{fig:example context}. Let $\Ch_{\vec{X}}(U_i) = \Ch(U_i) \cap \vec{X}$.

\begin{lemma}[Redundancy with $\Ugood$] \label{lem: new applied DPI}
In the dropout function setting, for some $U_i \in \vec{U}$, if corresponding $M_i \doubleactivepathNB U_i \doubleactivepathBN Y$, then
\begin{equation*}
    \I(M_i:Y \given \vec{X}) = \alpha_{M_i,U_i}(1 - \alpha_{U_i, \Ch_{\vec{X}}(U_i)})\alpha_{U_i,Y} \H(M_i).
\end{equation*}
\end{lemma}
Lemma~\ref{lem: new applied DPI} comes from multiplying the probability of transmission of each edge along the path $M_i, U_i, Y$. We also pick up a term requiring that the $U_i, \vec{X}$ edges do not transmit, in which case conditioning on $\vec{X}$ would reduce the entropy of $U$ to nothing and close off the path.

\paragraph{``Bad'' $\vec{U}$} The inclusion of $\Ch(U_i)$ in $\vec{X}$ could open up active paths via colliders of the form $M_i \rightarrow U_i \leftarrow Y$. We call the set of these variables $\Ubad$. For an example, $\Ubad = \{U_2\}$ in Figure~\ref{fig:example context}.
\begin{lemma}[Redundancy with $\Ubad$]
\label{lem: new Collider DPI}
In the dropout function setting, $U_i \in \vec{U}, \vec{X} \subseteq \vec{V}$, if $M_i \doubleactivepathNC U_i \doubleactivepathCN Y$ then 
\begin{equation*}
     \I(M_i:Y \given \vec{X}) = \alpha_{U_i, \Ch_X(U_i)} \I(M_i:Y \given U_i) 
\end{equation*}
\end{lemma}
Lemma~\ref{lem: new Collider DPI} demonstrates that there are still proxies for which inclusion hurts our model's robustness. Similar concepts can be demonstrated via upper bounds when we allow arbitrary sets of structural equations - given in Appendix C. Optimizing these upper bounds does not give a guarantee of optimality, but can still point towards a general improvement.

\subsection{Feature Selection Implications}
The proxy graphical setup requires $\vec{X} \doubleactivepathNB \vec{U} \doubleactivepathBN Y$, meaning the relevance of our input is upper bounded by the redundancy with $U$,
$\I(\vec{X}: Y) \leq \I(\vec{U} : \vec{X})$.

Lemma~\ref{lem: new applied DPI} shows that proxies of $\Ugood$ help build accurate and universal models, while Lemma~\ref{lem: new Collider DPI} shows that proxies of $\Ubad$ can trade universality for domain-specific accuracy. Of course, proxies need not lie neatly in these two classes - many proxies contain a combination of universally-relevant and domain-relevant features. This suggests multiple classes of proxy variables.
\begin{definition}
    \begin{align}
        \Vgood &:= \Ch(\Ugood) \setminus \Ch(\Ubad)\\
        \Vbad &:= \Ch(\Ubad) \setminus \Ch(\Ugood)\\
        \Vambig &:= \Ch(\Ubad) \cap \Ch(\Ugood)
    \end{align}
\end{definition}

The behavior of $\Vgood$ in the dropout function setting shows how restricting models to invariant features fails; a high redundancy with $\Ugood$ is beneficial for the context sensitivity even though the paths from the proxies are unstable. Inclusion of $\Vgood$ in $\vec{X}$ improves context sensitivity even though $\Vgood$ is not made up of direct causes (as suggested by \cite{scholkopf2012causal}) or invariant features (as suggested by \cite{invariant_feature_selection} and \citep{subbaswamy2018counterfactual}).

For feature selection, an obvious strategy is to choose $\vec{X} = \Vgood$, avoid $\Vbad$, and potentially try using some elements in $\Vambig$. In the next section we will explore how we can use non-invertible functions to transform these $\Vambig$ into $\Vgood$.

\subsection{Proxy Bootstrapping}\label{sec:bootstrapping}
Given the robustness implications of the different classes of $V$, their partitioning into good, bad, and ambiguous partitions will be important. We will now demonstrate how to harness partial information to determine these partitions and classify proxies. This step is optional if the role of each proxy is already understood (as is the case when the DAG is known). The results in this subsection will \emph{only} require the graphical assumptions of the PBT setting - i.e. systemic sparsity, partial faithfulness, and an independent shifting mechanism $M_i$ for each $U_i \in \vec U$. 

We begin with an observation about the independence structure of the conditional probability distribution on $Y$.
\begin{lemma}[Linking related proxies]\label{lem:conditioning on y separates sets}
Within the graphical constraints of PBT, if $V_i \not \indep_d V_j \given Y$, then either they have a shared parent ($\Pa(V_i) \cap \Pa(V_j) \neq \emptyset$) or they both have at least one parent that is a cause of $Y$ (i.e. $\Pa(V_i) \cap \Pa(Y) \neq \emptyset$ and $\Pa(V_j) \cap \Pa(Y) \neq \emptyset$).
\end{lemma}

\begin{definition}\label{def: DependenceGraph}
    For a DSD $\G^+ = \{\vec{V} \cup \vec{U}\cup \vec{M}, \vec{E}\}$, define the dependence graph $\G_Y = (\vec{V}, \vec{E}_Y)$ to be an undirected graph with edges $(V_i, V_j) \in \vec{E}_Y$ iff $V_i \not \indep_d V_j \given Y$.
\end{definition}

Lemma~\ref{lem:conditioning on y separates sets} tells us that $\G_Y$ will have a clique on the sets $\Ch^\G(U)$ for $U \in \vec{U}$. Furthermore, conditioning on $Y$ links its causes, so $\G_Y$ has one large clique on $\Ch^\G(\Pa(Y))$. This clique structure can be utilized to enhance partial knowledge of $\Ch(\Ugood)$ and $\Ch(\Ubad)$. In this sense, ``birds of a feather flock together'' -- information about each clique's proxies can be a determined from understanding a single member of that clique.
 \begin{lemma}[Information about seed proxies spreads]\label{lem: bootstraping} If $V_i \in \Vgood$ then all neighbors of $V_j \in \Nb^{\G_Y}(V_i)$ are not in $\Vbad$ - i.e. $V_j \in \Vgood \cap \Vambig$. If $V_i \in \Vbad$ then all neighbors of $V_j \in \Nb^{\G_Y}(V_i)$ are not in $\Vgood$ - i.e. $V_j \in \Vbad \cap \Vambig$.
 \end{lemma}
 
 Lemma~\ref{lem: bootstraping} suggests a simple algorithm for bootstrapping the sets $\Vgood, \Vbad, \Vambig$ from a set of ``seed'' vertices $\vec{V}^* \subseteq \vec{V}$ with known assignments to $\Vgood, \Vbad, \Vambig$.

\begin{enumerate}
    \setlength{\itemsep}{0pt}
     \item Construct $\G_Y$ according to Definition~\ref{def: DependenceGraph} using conditional independence tests.
     \item For each $V^* \in \vec{V}^*$, if $V^* \in \Vgood$ then add a ``good'' label to $\Nb(V^*)$. If $V^* \in \Vbad$ then add a ``bad'' label to $\Nb(V^*)$.
     \item All $V \in \vec{V} \setminus \vec{V}^*$ with both ``good'' and ``bad'' labels receive an ``ambigious'' label instead.
\end{enumerate}
 
 \begin{theorem}[Proxy bootstrapping works]\label{thm: proxy bootstrapping works}
 Upon termination of proxy bootstrapping all vertices with a single label are correctly described if :
 \begin{enumerate}
    \setlength{\itemsep}{0pt}
    \item Partial faithfulness holds.
     \item $\vec{V}^*$ has at least one $V^* \in \vec{V}^* \cap \Ch(U)$ for each $U \in \Ugood \cap \Ch(Y)$.
     \item $\vec{V}^*$ has at least one $V^* \in \vec{V}^* \cap \Ch(\Pa(Y))$.
     \item $\vec{V}^*$ has at least one $V^* \in \vec{V}^*$ for each $U \in \Ubad$.
 \end{enumerate}
 \end{theorem}
 
 
\section{Causal Information Splitting}
This section will expand our theory into \textbf{feature engineering}, which allows us to build inputs on functions of $\vec{V}$. A main takeaway from Section~\ref{sec:context_sensitivity} was that we should build models using proxies for $\Ugood$ and avoid using features that are proxies for $\Ubad$. The extension of this to engineered features is to build a model on \emph{functions} of proxies for which the output of those functions \emph{is} related to $\Ugood$ and \emph{not} related to $\Ubad$. We present two lemmas to formalize this notion.


Let $\widetilde{\Ch}_{\vec{X}}(U_i)$ be the children or functions of children of $U_i$ in $\vec{X}$. Lemma~\ref{lem: new applied DPI engineering} shows that building models with more redundancy with $\Ugood$ (i.e. lower $\H(U_i \given \widetilde{\Ch}_{\vec{X}}(U_i)$) improves our context sensitivity in the dropout function setting.\footnote{Appendix C shows that redundancy with $\Ugood$ lowers an upper bound on context sensitivity in more general cases}

\begin{lemma}[Engineering redundancy for $\Ugood$] \label{lem: new applied DPI engineering}
In the dropout function setting, if $U_i \in \Ugood$ then
\begin{equation*}
    \I(M_i:Y \given \vec{X}) = \alpha_{M_i,U_i}\alpha_{U_i,Y}\H(U_i \given \widetilde{\Ch}_{\vec{X}}(U_i)).
\end{equation*}
\end{lemma}

Of course, even good proxies are related to $\Ubad$ through their connection to $Y$, so $\vec{X} \indep \Ubad$ is impossible. Instead, Lemma~\ref{lem: if we dont include information with bad u, then we are happy} tells us that if we avoid redundancy with $\Ubad$ after conditioning on $Y$, we do not pick up any context sensitivity from the associated shifting mechanisms.

\begin{lemma}[Avoiding redundancy with $\Ubad$] \label{lem: if we dont include information with bad u, then we are happy}
For some $U_i \in \Ubad$, if we maintain $\I(U_i: \vec{X} \given Y) = 0$, then $\I(M_i:Y \given \vec{X}) = 0$.
\end{lemma}

Recall that ambiguous proxies contain information about both $\Ugood$ and $\Ubad$. The inclusion of an ambiguous proxy $V_A$ improves context sensitivity because of its redundancy with $\Ugood$ via Lemma~\ref{lem: new applied DPI engineering}. This section will develop a technique for filtering $V_A$ into $F(V_A)$, which will satisfy the conditions in Lemma~\ref{lem: if we dont include information with bad u, then we are happy}. To do this, we will require separability.

\paragraph{Separable Ambigious Proxies}
\begin{figure}[h]
    \centering
    \scalebox{.45}{
    \begin {tikzpicture}[-latex ,auto ,node distance =2 cm and 1.5 cm ,on grid , thick, state/.style ={ circle, draw, minimum width =.85 cm}, cstate/.style ={ circle, draw, minimum width =.8 cm, ultra thick}]
            \filldraw[color=black, fill=black!5, very thick](-1.4,-3.6) rectangle (1.4, -5.4)
            node[right] {$V_A$};
            \node[state] (Y) [] {$Y$};
            \node[state, dashed, color=blue] (UG) [below left = 3cm and 2 cm of Y] {$U_G$};
            \node[state, dashed, color=blue] (UB) [below right=3cm and 2cm of Y] {$U_B$};
            \node[state, accepting, color=green!50!black] (MUG) [above left = 1.5cm and 2cm of UG] {$M_{G}$};
            \node[state, accepting, color=green!50!black] (MUB) [above left =1.5cm and 2cm of UB] {$M_{B}$};
            \node[state] (VG) [below right = 1.5cm and 1.35cm of UG] {$V_A^{(G)}$};
            \node[state] (V1) [below left = 1.5cm and 2cm of UG] {$V_G$};
            \node[state] (V2) [below right = 1.5cm and 2cm of UB] {$V_B$};
            \node[state] (VB) [below left = 1.5cm and 1.35cm of UB] {$V_A^{(B)}$};
            \path[very thick] (UG) edge (VG) (UG) edge (Y) (Y) edge (UB) (UB) edge (VB);
            \path[very thick] (MUG) edge (UG) (MUB) edge (UB);
            \path[very thick] (UG) edge (V1) (UB) edge (V2);
    \end{tikzpicture}
    }
    \caption{$V_G \in \Vgood$, $V_B \in \Vbad$. $V_A \in \Vambig$ is a linear transformation of two components, $V_A^{(G)},V_A^{(B)}$, which are good and bad respectively.\label{fig:ambiguous and splitable}} 
\end{figure}
Consider the setup in Figure~\ref{fig:ambiguous and splitable}, where $V_G \in \Vgood$, $V_B \in \Vbad$, and $V_A \in \Vambig$. $V_A$ is generated by invertible $\mathcal{T}_A$, making it a \textbf{separable ambiguous proxy (SAP)}.\footnote{While we may still be able to gain useful information from non-separable proxies, the tradeoffs are difficult to quantify and hence beyond the scope of this paper.}
Splitting $V_A$ into components allows us to isolate the origins of its ambiguity - the mixing of good information from $V_A^{(G)}$ and bad information from $V_A^{(B)}$.

\subsection{Isolation Functions}\label{ssec:splitting}
We would like to isolate $V_A^{(G)}$ from $V_A$ to avoid paying the penalty for $V_A^{(B)}$. We will do this using \textbf{isolation functions}.

\begin{definition}%[Isolation]
We define an \textbf{isolation function} of $V_i$ on $V_A$, with optional conditioning on $y$, to be
\begin{equation}
    \begin{split}
    &\Fisolate{V_i}{V_A \given y}:=\argmin_F \H(F(V_A \given y))\\
    &\text{such that}  \I(F(V_A): V_i \given y) = \I(V_A : V_i \given y).
\end{split}
\end{equation}
$\Fisolate{V_i}{V_A \given Y}$ gives a vector of functions with an entry for each $y \in Y$.
\end{definition}
Note that isolation functions are sufficient statistics for $V_i$ \citep{cover1999elements}. Isolation involves maintaining the information about $V_i$ while removing excess noise.

Recall from Lemma~\ref{lem: if we dont include information with bad u, then we are happy} that in order to avoid worsening context sensitivity, we want to ensure $\I(F(V_A): \Ubad \given Y) = 0$. Isolation functions on SAPs are well designed for this purpose, because they enforce the independence properties of the isolated vertex on their outputs. In order to achieve $\I(F(V_A): \Ubad \given Y) = 0$ while preserving as much information about $\Ugood$ as possible, an optimal isolation function would be to isolate $\Ugood$ using $\Fisolate{\Ugood}{V_A \given Y}$. 

Of course, we do not have access to $\Ugood$, so our next best option is to isolate $\Vgood$ using $\Fisolate{\Vgood}{V_A \given Y}$, since $\Ubad \indep \Vgood \given Y$. Lemma~\ref{lem: isolation bound 2} shows that the output of $\Fisolate{V_G}{V_A \given Y}$ behaves like a good proxy if $V_G \in \Vgood$ and $V_A$ is a SAP.
\begin{lemma}[Isolating $\Vgood$ behaves like $\Vgood$]\label{lem: isolation bound 2}
For $V_G \in \Vgood$ and $U_B \in \Ubad$ and an isolation function $\Fisolate{V_G}{V_A \given Y}$, 
\[
   \I(U_B : \Fisolate{V_G}{V_A \given Y} \given Y) = 0.
\]
\end{lemma}
The benefit from $\Fisolate{V_G}{V_A\given Y}$'s information about $\Ugood$ is difficult to quantify for use with Lemma~\ref{lem: new applied DPI engineering}, but lower bounds are obtained in Appendix D.

Even without a quantification of improvement, Theorem~\ref{thm: when help and do no harm} shows that isolation functions can avoid worsening the context sensitivity, while certain conditions can guarantee relevance gains for predicting $Y$.
\begin{theorem}[CIS costs and benefits]\label{thm: when help and do no harm}
Consider $V_G \in \Vgood$ and $V_A \in \Vambig$ where $V_A$ is a SAP. Also consider the isolation function $\Fisolate{V_G}{V_A \given Y}$. We will compare the context sensitivity of inputs $\vec{X} := \{V_G\}$ and $\vec{X}^+ := \{V_G, \Fisolate{V_G}{V_A \given Y})\}$. We claim that $\I(M:Y \given \vec{X}^+) \leq \I(M:Y \given \vec{X})$ for all $M \in \vec{M}$.
Furthermore, if 
\begin{equation}\label{eq: thm2}
\begin{aligned}
    \I(\Fisolate{V_G}{V_A \given Y}:V_G) &<\\ \I(\Fisolate{V_G}{V_A \given Y}&:V_G \given Y),
\end{aligned}
\end{equation} then the relevance improves: $\I(Y:\vec{X}^+) > \I(Y:\vec{X})$.
\end{theorem}
Theorem~\ref{thm: when help and do no harm} tells us that using an isolation function helps when the function is more predictive of the isolated variable in the post-selected $Y$ distribution than it is in the full distribution. This condition is sufficient but loose  because it does not take into account direct effects from $\I(Y:\Fisolate{V_G}{V_A \given Y})$ (for which we have no guaranteed bounds). The proof is given in Appendix E.

\subsection{Auxiliary Training Tasks} \label{sec:auxtask}
In the infinite sample regime, consider an ``optimal'' model $F(\cdot)$ that predicts $V_i$ using input $V_A$. Optimal models should utilize all of the information available for prediction in their inputs, meaning $I(F(V_A): V_i) = \I(V_A: V_i)$. Information theoretically, minimizing $\H(F_{V_i}(V_A))$ corresponds to reducing the outputs of $F_{V_i}(V_i)$ to equivalence classes wherein $\Pr(V_A \given F_{V_i}(V_i) = f)$ is constant. This minimization corresponds to ensuring $F_{V_i}(V_i)$ does not over-fit to the empirical values of $V_A$ using noise that is orthogonal to $\Pa(V_A)$.

Auxiliary training tasks can therefore be used in place of isolation functions: we can get an approximate isolation function, $\TFisolate{V_i}{\Vsap}$, by training a model to predict $V_i$ using input $\Vsap$. We do not give any theoretical results beyond intuition for this interpretation, but will support our claims with experiments in the next section.

Equation~\ref{eq: thm2} in Theorem~\ref{thm: when help and do no harm} also has a nice interpretation within the training context -- the accuracy of the predictor must degrade when moving from the post-selected data to the full dataset. More precisely, the conditions for improvement now translate to
\begin{equation}
\begin{split}
    \min_F &\E[\mathrm{Error}(F(V_A), V_G)] \\
    &> \sum_{y} \Pr(y) \min_F(\E[\mathrm{Error}(F(V_A), V_G) \given y]),
\end{split}
\end{equation}
which can easily be checked on our training data.

\subsection{Suggested Overall Procedure}
We propose the following procedure for building robust (low context-sensitivity) models in the PBT problem.
\begin{enumerate}
    \setlength{\itemsep}{0pt}
    \item Partition the data into constant $Y=y$ and determine cliques of dependence.
    \item Using domain knowledge, identify seeds in $\Vgood, \Vbad$ for proxy bootstrapping (Sec.~\ref{sec:bootstrapping}).
    \item Perform CIS on $\Vambig$  (Sec.~\ref{sec:auxtask}).
    \item Build a prediction model for $Y$ using $\Vgood$ and the CIS-engineered $\Vambig$.
\end{enumerate}


\section{Experiments} \label{sec:aux}
 We will now demonstrate the effectiveness of these methods on synthetic and real world data. Full code for both of these experiments is available at \url{https://zenodo.org/badge/latestdoi/651823136}.

\subsection{Experiments on Synthetic Data}
We generate data for the DAG in Figure~\ref{fig:ambiguous and splitable} based on normal distributions, see details of the setup in Appendix F. We vary the standard deviations of normally distributed $M_G$ and $M_B$.
% are drawn from normal distributions with mean $0$ and variable standard deviations. All other vertices (other than $Y$) are the average of their parents plus additional Gaussian noise $N(0, .2)$. $T_A\in \R^2$ is generated by applying a rotation matrix to $(T_A^{(G)}, T_A^{B})^T$\footnote{Many rotations were tried in our experiments with identical results, so we display results from a $45$ degree rotation.}. $Y$ indicates whether its parents  sum to a positive number with a $5\%$ probability of flipping randomly.
The training data is drawn from $\sigma(M_G) = \sigma(M_B) = 1$, while the testing data varies both quantities and thus the influence of the context. We measure the accuracy of our feature engineering based on CIS, $\hat{Y}^{(3)}(V_G, \TFisolate{V_G}{V_A})$, that  utilizes the auxiliary task approximation to isolate $V_A$'s predictive information about $V_G$. We compare it to $\hat{Y}^{(1)}(V_G, V_A)$ trained on $\Vgood \cup \Vambig$ and $\hat{Y}^{(2)}(V_G)$ trained on only $\Vgood$. For a theoretical limit of CIS we also compare to  $\hat{Y}^{(4)}(V_G, V_A^{(G)})$ although access to $V_A^{(G)}$ is usually not possible.
% \begin{enumerate}
%     \setlength{\itemsep}{0pt}
%     \item $\hat{Y}^{(1)}(V_G)$ is trained on only $\Vgood$.
%     \item $\hat{Y}^{(2)}(V_G, V_A)$ is trained on $\Vgood \cup \Vambig$.
%     \item $\hat{Y}^{(3)}(V_G, \TFisolate{V_G}{V_A})$ utilizes the auxiliary task approximation for CIS to isolate $V_A$'s predictive information about $V_G$.
%     \item $\hat{Y}^{(4)}(V_G, V_A^{(G)})$ represents the theoretical limit of CIS if we were given direct access to $V_A^{(G)}$.
% \end{enumerate}


\begin{figure}[ht]
    \centering
    (a)\includegraphics[width = .4\textwidth]{experiment_plots/SigmaMG.pdf}
    (b)\includegraphics[width = .4\textwidth]{experiment_plots/SigmaMB.pdf}
    \caption{Results from our experiments on synthetic data. Single standard deviation confidence intervals are shaded in the corresponding colors.
    %In (b), $\hat{Y}^{(3)}$ matches up almost exactly with $\hat{Y}^{(4)}$. 
    }
    \label{fig:experiments_proxies}
\end{figure}


\paragraph{Results} When comparing feature selection approaches, we observe in Figure~\ref{fig:experiments_proxies} that including $V_A$ results in higher accuracy of $\hat{Y}^{(1)}$ over $\hat{Y}^{(2)}$ when the shift acts on $\Ugood$ (a) or is small for $\Ubad$ (b). However, the accuracy of $\hat{Y}^{(2)}$ deteriorates with bigger shifts in $\Ubad$.

Our proposed method based on causal information splitting offers a middle ground.  $\hat{Y}^{(3)}$ is able to maintain the same robustness as $\hat{Y}^{(2)}$ while taking advantage of some of the gains enjoyed by $\hat{Y}^{(1)}$ in (a). In fact, $\hat{Y}^{(3)}$ performs very similarly to $\hat{Y}^{(4)}$, which had a-priori knowledge of the SAP components and used only $V_A^{(G)}$. These improvements were achieved despite not meeting the sufficient condition for increasing relevance in Theorem~\ref{thm: when help and do no harm}.

\subsection{Experiments on Census Data} \label{sec:real_data}
We use US Census data processed through folktables~\cite{ding2021retiring} to predict whether the income of a person exceeds 50k following ~\cite{Dua:2019}. %It contains information about the income, education, commute distance, health insurance (among other features) of people. 
To test out-of-domain generalization, prediction models were built on 2019 pre-pandemic data and evaluated on 2021 data during the pandemic.\footnote{We ignored the experimental release of 2020 data to ensure a starker distribution shift.} 
As model inputs, we consider commute time (coded as JWMNP in the dataset), a flag whether the person received Medicaid, Medical Assistance, or any kind of government-assistance plan for those with low incomes or a disability (coded as HINS4) and education level (SCHL). 
This small feature set was purposefully selected to see a starker effect of including/excluding individual features, including  a feature with relatively stable predictive power (education level) and two features heavily affected by the pandemic through increased work-from-home and medicaid's continuous enrollment provision.  

Our auxiliary task from Sec.~\ref{sec:auxtask}, referred to as \emph{engineered features}, does not use HINS4 and JWMNP directly as input features to predict the income level. Instead it uses HINS4 and JWMNP to train two models predicting the education-level: One trained on examples with high income and one trained on examples with low income. These predictions based on HINS4 and JWMNP together with the  actual education-level serve as input features to the final model.  
We compare the model built on these engineered features to ones using all three features directly (\emph{all features}) or using just the stable education feature (\emph{limited features}).
 
We use logistic regression from sklearn with l1 regularization to build models based on the different feature sets that the three methods created. l1 regularization yielded better generalization than l2 regularization.

\begin{table}[h!]
    \centering
    \small
    \caption{Comparison of out-of-domain (2021) performance via mean of accuracy.}
    \scalebox{.9}{
    \begin{tabular}{|c | l l l |  }
    %  & \multicolumn{3}{|c|}{2021 (out-of-domain) accuracy}  \\
     \hline
     State & All Features & Engineered Features & Limited Features   \\ 
    \hline
    % evaluate.report_accuracy_with_std_dev(across_repetitions, big_states,  years=[2021], split='test')
CA  & \textbf{0.712} $\pm$ 0.0011 & \textbf{0.711} $\pm$ 0.0014 & 0.692 $\pm$ 0.0014 \\
FL  & \textbf{0.683} $\pm$ 0.0012 & 0.678 $\pm$ 0.0018 & 0.68 $\pm$ 0.0013 \\
GA  & 0.689 $\pm$ 0.0025 & \textbf{0.707} $\pm$ 0.0055 & \textbf{0.709} $\pm$ 0.0029 \\
IL  & 0.662 $\pm$ 0.0026 & \textbf{0.689} $\pm$ 0.0033 & 0.684 $\pm$ 0.0019 \\
NY  & \textbf{0.707} $\pm$ 0.0022 & \textbf{0.702} $\pm$ 0.0025 & 0.687 $\pm$ 0.008 \\
NC  & \textbf{0.691} $\pm$ 0.0031 & \textbf{0.684} $\pm$ 0.0034 & \textbf{0.683} $\pm$ 0.003 \\
OH  & 0.689 $\pm$ 0.0022 & \textbf{0.703} $\pm$ 0.004 & \textbf{0.696} $\pm$ 0.0029 \\
PA  & 0.672 $\pm$ 0.0017 & \textbf{0.695} $\pm$ 0.0023 & 0.688 $\pm$ 0.0022 \\
TX  & 0.69 $\pm$ 0.0029 & \textbf{0.712} $\pm$ 0.0028 & \textbf{0.712} $\pm$ 0.0027 \\
\hline
avg & 0.688 & \textbf{0.698}  & 0.692 \\
\hline
    \end{tabular}}
    \label{tab:test_real_results}
\end{table}

\noindent\textbf{Results} Table~\ref{tab:test_real_results} reports the mean and standard deviation of accuracies for 10 different test splits. For the F1 scores of the same experiment, see Appendix F.  Using all features leads to the best in-domain performance (see Appendix F), but not necessarily the best out-of-domain performance. Dropping the ambiguous features hurts predictive power in limited feature models, but helps with robustness varies across the states: these limited models even perform better on 2021 data. 
Our proposed feature engineering using CIS achieves the best of both worlds, with the best mean out-of-domain accuracy of 0.698. It also achieves close to the best out-of-domain accuracy for 8 out of 9 states. 

\section{Discussion}\label{sec:conclusion}
In this paper we studied the challenging problem of building models that are robust to distribution shift when causes and effects of the target variable are unmeasured. Among the observed noisy proxies, we showed how to perform feature selection based on conditional independence tests and knowledge about some seed nodes. 

After bootstrapping, we often have a significant number of ambiguous proxies, which have components that are both helpful and hurtful to our model's robustness. Through CIS, however, we showed how to isolate robust predictive power from these ambiguous proxies using auxiliary learning tasks. We proved that including these engineered features safely increases robustness in our setting, while also improving accuracy. In our experiments on real census data under shifts due to the pandemic, we showed that the engineered features provided benefits for most states over using the ambiguous features directly or completely ignoring them. While our theoretical framework is involved, these experiments demonstrate improvements outside of our assumptions.

\paragraph{Relaxation of Assumptions}
A number of our assumptions can be softened. One softening of systemic sparsity would involve allowing edges within $\vec{U}$ so long as their dependence is relatively weak. Such a relaxation would involve using mutual information (or correlation) thresholds instead of independence tests. Sparsity assumptions may also be relaxed by building on ideas from mixtures of DAG structures like ~\citep{gordon2021identifying}.

The strongest assumption is that of separable ambiguous proxies. Under a softening of the separability assumption, we cannot guarantee that we have isolated only robust information from our ambiguous proxy -- some unstable information associated with $\Ubad$ may slip through. However, degrees of separability may still guarantee the benefit of the engineered feature.

While separability corresponds to invertability with linear functions, there are many examples of nonlinear that are separable. For example, when the effects of two causes have significantly different magnitudes they can be easily disentangled, such as fine and hyper-fine structures in atomic energy levels. Work on data fission \citep{leiner2022data} may provide valuable insights to help understand the degrees of separability for different choices of functions.


\begin{acknowledgements}
    This work was completed at the Amazon Causality Lab in Tübingen, Germany. We would like to thank Dr. Leena Chennuru Vankadara for providing valueble feedback on the paper.
\end{acknowledgements}
% References
\bibliography{biblio}
\end{document}
