%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% If you use BibTeX in apalike style, activate the following line:
% \bibliographystyle{apalike}

\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{bbm}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{comment}  
\usepackage{graphicx}
\usepackage{subcaption}

\usepackage[ruled,vlined]{algorithm2e}
            
\usepackage{cleveref}
\crefformat{section}{\S#2#1#3} % see manual of cleveref, section 8.2.1
\crefformat{subsection}{\S#2#1#3}
\crefformat{subsubsection}{\S#2#1#3}

% LOCAL MACRO DEFS ----------------------
\newtheorem*{theorem*}{Theorem}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}


\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

% Data
\newcommand{\mP}{\mathbf{P}}
\newcommand{\cP}{\mathcal{P}}

\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{y}
\newcommand{\X}{\mathbf{X}}
\newcommand{\Y}{\mathbf{Y}}
\newcommand{\z}{\mathbf{z}}
\newcommand{\e}{\mathbf{e}}

\newcommand{\h}{f_h}
\newcommand{\s}{f_m}
\newcommand{\dom}{\textnormal{dom}}

\newcommand{\tr}{\textnormal{train}}
\newcommand{\te}{\textnormal{test}}

% Model
\newcommand{\ts}{\Theta^\star}
\newcommand{\wt}{\widehat{\theta}}
\newcommand{\cT}{\Theta}
\newcommand{\tT}{\theta \in \Theta}

\newcommand{\hs}{\theta^\star}
\newcommand{\wh}{\widehat{theta}}
\newcommand{\cH}{\Theta}
\newcommand{\hH}{\theta\in \Theta}
\newcommand{\mI}{\mathbb{I}}
\newcommand{\eptt}{\epsilon_{\mP_t}(\theta)}
\newcommand{\wepst}{\widehat{\epsilon}_{\mP_s}(\theta)}


\newcommand{\hl}[1]{\textbf{\color{blue}{\bf\sf [Hanlin: #1]}}}

\usepackage{listings}
\usepackage{fancyvrb}

\title{Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Haohan Wang}
\author[2]{Zeyi Huang}
\author[1]{Hanlin Zhang}
\author[2]{Yong Jae Lee}
\author[1,3,4]{Eric P. Xing}
% Add affiliations after the authors
\affil[1]{%
    School of Computer Science\\
    Carnegie Mellon University\\
    Pittsburgh, PA, USA
}
\affil[2]{%
    Department of Computer Sciences\\
    University of Wisconsin-Madison\\
    Madison, WI, USA
}
\affil[3]{%
    Mohamed bin Zayed University of Artificial Intelligence\\
    Abu Dhabi, United Arab Emirates
  }
\affil[4]{%
    Petuum, Inc.\\
    Pittsburgh, PA, USA
  }
  
  \begin{document}
\maketitle


% \begin{document}

% \twocolumn[

% \aistatstitle{Toward Learning Human-aligned Cross-domain Robust Models
% \\by Countering Misaligned Features}

% \aistatsauthor{ Author 1 \And Author 2 \And  Author 3 }

% \aistatsaddress{ Institution 1 \And  Institution 2 \And Institution 3 } ]

\begin{abstract}
Machine learning has demonstrated 
remarkable prediction accuracy over \textit{i.i.d} data, 
but the accuracy 
often drops when tested 
with data from another distribution. 
In this paper, 
we aim to offer another view of this problem
in a perspective 
assuming 
the reason behind this accuracy drop
is the reliance of models 
on the features that are not aligned well with how a data annotator considers similar across these two datasets. 
We refer to these features as misaligned features. 
We extend the conventional generalization error bound 
to a new one 
for this setup
with the knowledge of how the misaligned
features are associated with the label.
Our analysis offers a set
of techniques for this problem, 
and these techniques are naturally linked to many previous methods in robust machine learning literature. 
We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with implementation available \href{https://github.com/OoDBag/WR}{here}. 
\end{abstract}

\section{Introduction}
\begin{figure*}[t]
    % \centering
    % \subfigure[the major challenge of learning robust models] {\includegraphics[width=0.4\textwidth]{figs/intro.pdf}} 
    % \quad\quad\quad 
    % \subfigure[a toy example explaining the challenge] {\includegraphics[width=0.4\textwidth]{figs/example.pdf}} 
    % \caption{The main problem focused: (a) we argue the main challenge of learning robust models is the correlation between the actual "semantic" information and the "bias" information due to finite samples. (b) A toy example showing that, as a result of the former argument, an ERM model will not learn a decision boundary generalizes to another distribution even when the marginal distributions are aligned and the oracle labelling functions are the same. }
    \centering 
    \includegraphics[width=0.9\textwidth]{intro.pdf}
    \caption{An illustration of the main problem focused in this paper. 
    %as we aim to classify triangles vs. circles, the spurious correlation between color and shape in the training distribution will likely mislead the model to learn a spurious decision boundary (the distribution-specific labelling function), which may not be effective even if the marginals are aligned, but there exists another decision boundary (labeling function), which can classify the target distribution data correctly even if the marginals are not aligned.
    }
    \label{fig:intro}
\end{figure*}


Machine learning, 
especially deep neural networks, 
has demonstrated remarkable empirical successes
over various applications. 
The models even
occasionally achieved
results beyond human-level performances 
over benchmark datasets
\citep[\textit{e.g.,}][]{he2015delving}. 
However, whether it is desired 
for a model to outsmart human on benchmarks remains 
an open discussion in recent years: 
indeed,
a model can create more application opportunities when it surpasses
human-level performances,
but the community also notices that the performance gain
is sometimes due to model's exploitation of the features 
meaningless to a human, 
which may lead to unexpected performance drops when the models
are tested with other datasets in practice 
that a human considers similar to the benchmark \citep{christian2020alignment}. 

One of the most famous examples of the model's exploitation of 
non-human-aligned features is probably the 
usage of snow background in ``husky vs. wolf'' image classification \citep{ribeiro2016should}. 
Briefly, 
when the model is trained to classify ``husky vs wolf,''
it notices that wolf images usually 
have a snow background and learns to use the background features. 
% Multiple practical challenges of image background serving as data artifacts
% have also been discussed recently \citep{WangGLX19,?,?,?}. 
% Recent studies also broadened this topic by showing that 
This example is only one of many similar discussions 
concerning that the models are using features considered futile by humans \citep[\textit{e.g.,}][]{WangGLX19,SunGTHEZMBCW19}, 
and, sometimes, the features used 
are not even perceptible to a human 
\citep{geirhos2018imagenettrained,ilyas2019adversarial,wang2020high,hermann2020origins}. 
The usage of these features might lead to a misalignment between the human and the models' understanding of the data, 
leading to a potential performance drop when the models are applied to other data that a human considers similar. 

We illustrate this challenge with a toy example in Figure~\ref{fig:intro},
where the model is trained on the source domain data 
to classify triangle vs. circle
and tested on the target domain data 
with a different marginal distribution. 
However, the color coincides 
with the shape on the source domain. 
As a result, the model might learn either the shape function
or the color one. 
The color function 
will not classify the target domain data correctly 
while the shape function can, 
but the empirical risk minimizer (ERM) cannot differentiate them 
and might learn either one, 
leading to potentially degraded performances during the test. 
As one might expect, 
whether shape or color is considered human-aligned 
is subjective depending on the task or the data 
and, in general, irrelevant to the statistical nature of the problem. 
Therefore, our remaining analysis will depend on such knowledge. 
% so that the conclusion will apply to a broader scope of problems
% free of the constraint of the statistical property of the data. 

In this paper, we aim to formalize the above challenge 
to study the learning of human-aligned models. 
In particular, we derive a new generalization error bound 
when a model is trained on one distribution
but tested on another one that human consider similar. 
As discussed previously, 
one potential challenge for this scenario
is that the model may learn 
to use some features, 
which we refer to as \emph{misaligned features},
that a human considers irrelevant. 
Corresponding to this challenge, 
our analysis will be built upon the knowledge of 
how misaligned features are associated with the label. 

\section{Related Work}
\label{sec:related}
There is a recent proliferation of methods 
aiming to learn robust models 
by enforcing the models to disregard certain features. 
We consider these works direct precedents of our discussion
because these features are usually 
defined when comparing the model's performances to a human's. 
For example, the texture or background of images is probably 
the most discussed misaligned features for image classification. 
We briefly discuss these works in two main strategies. 

\paragraph{Data Augmentation}
With the knowledge of the misaligned features, 
the most effective solution is probably to augment the data
by perturbing these misaligned features.
Some recent examples of the perturbations used to train robust models 
include style transfer of images \citep{geirhos2018imagenettrained}, 
naturalistic augmentation (color distortion, noise, and blur) of images \citep{hermann2020origins}, 
other naturalistic augmentations (texture, rotation, contrast) of images \citep{wang2020squared},
interpolation of images \citep{hendrycks2019augmix}, 
syntactic transformations of sentences \citep{MahabadiBH20}, 
and across data domain \citep{ShankarPCCJS18,huang2020self,lee2021removing,huang2022two}. 

Further, as recent studies suggest that 
one reason for the adversarial vulnerability \citep{szegedy2013intriguing,goodfellow2015explaining} 
is the existence of imperceptible features correlated with the label \citep{ilyas2019adversarial,wang2020high}, 
improving adversarial robustness may also be about 
countering the model's tendency toward learning these features. 
Currently, one of the most widely accepted methods to improve adversarial robustness 
is to augment the data along the training process to maximize the training loss 
by perturbing these features within predefined robustness constraints 
(\textit{e.g.}, within $\ell_p$ norm ball) \citep{MadryMSTV18}.
While this augmentation strategy is widely referred to as adversarial training, 
for the convenience of our discussion, 
we refer to it as the
worst-case data augmentation, following the naming conventions of \citep{FawziSTF16}. 

\paragraph{Regularizing Hypothesis Space}
Another thread is to introduce inductive bias 
(\textit{i.e.}, to regularize the hypothesis space)
to force the model to discard misaligned features. 
To achieve this goal, 
one usually needs to first construct a side component 
to inform the main model about the misaligned features, 
and then to regularize the main model according to the side component. 
The construction of this side component 
usually relies on prior knowledge of what the misaligned features are.
Then, methods can be built accordingly to counter the features such as the texture of images \citep{WangHLX19,bahng2019learning}, 
the local patch of images \citep{WangGLX19}, 
label-associated keywords \citep{he2019unlearn},
label-associated text fragments \citep{MahabadiBH20}, 
and general easy-to-learn patterns of data \citep{nam2020learning}. 

% specific data property
% patterns easy-to-learn 
% patterns hard-to-learn (trailing eigenvalues)

% domain adaptation
In a broader scope, following the argument
that one of the main challenges of domain adaptation is to 
counter the model's tendency in learning domain-specific features \citep[\textit{e.g.},][]{GaninUAGLLML16, li2018domain}, 
some methods contributing to domain adaption
may have also progressed along the line of our interest. 
The most famous example is probably 
the domain adversarial neural network (DANN) \citep{GaninUAGLLML16}. 
Inspired by the theory of domain adaptation \citep{ben2010theory}, 
DANN trains the cross-domain generalizable neural network with the help of a side component specializing in classifying samples' domains. 
The subtle difference between this work and the ones mentioned previously is that 
this side component is not constructed with a special inductive bias 
but built as a simple network learning to classify domains with auxiliary annotations (domain IDs). 
DANN also inspires a family of methods 
forcing the model to learn auxiliary-annotation-invariant representations 
with a side component such as \citep{ghifary2016deep,rozantsev2018beyond,motiian2017unified,li2018domain,carlucci2018agnostic}. 

\paragraph{Relation to Previous Works}
The above methods solve the same human-aligned learning problems 
with two different perspectives, but we notice the same 
central theme of forcing the models to \emph{not} learn something 
according to the prior knowledge of the data or the task. 
Although this central theme has been noticed by prior works such as \citep{WangHLX19,bahng2019learning,MahabadiBH20}, 
we notice a lack of formal analysis from a task-agnostic viewpoint. 
Therefore, we continue to investigate whether we can 
contribute a principled understanding of this central theme, 
which serves as a connection of these methods and, 
potentially, a guideline for developing future methods. 
Also, we notice that many works along the domain adaptation development 
have rigorous statistical analysis \citep{ben2007analysis,ben2010theory,MansourMR09,GermainHLM16,ZhangLLJ19,dhouib2020margin}, 
and these analyses mostly focus on the alignment of the distributions. 
Our study will complement these works by investigating through the perspective of misaligned features. 
The advantages and limitations of our perspective will also be discussed. 

\section{Generalization Understanding of Human-aligned Robust Models}
\label{sec:cua}
\paragraph{Roadmap} We study the generalization error bound of human-aligned robust model in this section. We will first set up the problem of studying the generalization of the model across two distributions, whose difference mainly lies in the fact that one distribution has another labelling function (namely, the misaligned labelling function) in addition to the one that is shared across both of these distributions (\textbf{A2}). 
Then, to help quantify the error bound, we need to define the active set (features used by the function) ($\mathcal{A}(f,\x)$ in \eqref{eq:a:def}), 
the difference between the two functions ($d(\theta, f, \x)$ in \eqref{eq:d:def}), 
and an additional term to quantify whether the model learns the function if the model can map the sample correctly ($r(\theta, \mathcal{A}(f,\x))$ in \eqref{eq:r}). 
With these terms defined, we will show a formal result on the generalization error bound, which depends on how many training samples are predicted correctly when the model learns the mis-aligned samples in addition to the standard terms. 


\subsection{Notations \& Background}
We consider a binary classification problem 
from feature space $\mathcal{X} \in \mathbb{R}^p$ to 
label space $\mathcal{Y} \in \{0, 1\}$. 
The distribution 
over $\mathcal{X}$ is denoted as $\mP$. 
A \emph{labeling function} $f:\mathcal{X} \rightarrow \mathcal{Y}$
is a function that
maps the feature $\x$ to its label $\y$. 
A \emph{hypothesis} or \emph{model} 
$\theta:\mathcal{X} \rightarrow \mathcal{Y}$ is also 
a function that maps the feature to the label. 
The difference in naming is only because 
we want to differentiate 
whether the function 
is a 
natural property of the space or distribution (thus called a labeling function)
or a function to estimate (thus called a hypothesis or model). 
The hypothesis space is denoted as $\Theta$. 
We use $\dom$ to denote the domain (input space) of a function, 
thus $\dom(\theta) = \mathcal{X}$.

This work studies the generalization error 
across two distributions, 
namely source and target distribution, 
denoted as $\mP_s$ and $\mP_t$, respectively. 
We are only interested when these two distributions
are, considered by a human, similar but different: 
being similar means 
there exists a \emph{human-aligned labeling function}, $\h$, 
that maps any $\x \in \mathcal{X}$ to its label
(thus the label $\y := \h(\x)$); 
being different means 
there exists a \emph{misaligned labeling function}, $\s$, 
that for any $\x \sim \mP_s$, $\s(\x) = \h(\x)$. 
This ``similar but different'' property will 
be reiterated as an assumption (\textbf{A2}) later. 
We use $(\x,\y)$ to denote a sample, and use $(\X, \Y)_\mP$ to denote a finite dataset if the features are from $\mP$ (see detailed process from \textbf{A2}).
We use $\epsilon_\mP(\theta)$ to denote the expected risk of $\theta$ 
over distribution $\mP$,
and use $\widehat{\cdot}$ to denote the estimation of the term $\cdot$
(\textit{e.g.}, the empirical risk is $\widehat{\epsilon}_\mP(\wt)$).
We use $l(\cdot,\cdot)$ to denote a generic loss function. 

For a dataset $(\X, \Y)_\mP$, 
if we train a model with
\begin{align}
    \wt = \argmin_{\theta \in \Theta}\sum_{(\x, \y)\in (\X,\Y)_\mP}l(\theta(\x), \y),
    \label{eq:train}
\end{align}
previous generalization study suggests that we can expect the error rate to be bounded as 
\begin{align}
    \epsilon_\mP(\wt) \leq \widehat{\epsilon}_\mP(\wt) + \phi(|\Theta|, n,  \delta),
    \label{eq:bound:vanilla}
\end{align}
where $\epsilon_\mP(\wt)$ and $\widehat{\epsilon}_\mP(\wt)$ respectively are 
\begin{align*}
    \epsilon_\mP(\wt) = \mathbb{E}_{\x \sim \mP}|\wt(\x)-\y|=\mathbb{E}_{\x \sim \mP}|\wt(\x)-\h(\x)|
\end{align*}
and
\begin{align*}
    \widehat{\epsilon}_\mP(\wt) = \dfrac{1}{n}\sum_{(\x, \y)\in (\X, \Y)_\mP}|\wt(\x)-\y| ,
\end{align*}
and 
$\phi(|\Theta|, n,  \delta)$ is a function 
of hypothesis space $|\Theta|$, number of samples $n$, 
and the probability when the bound holds $\delta$. 
This paper expands the discussion with this generic form that 
can relate to several discussions, each with its own assumptions. 
We refer to these assumptions as \textbf{A1}. 
\begin{itemize}
    \item [\textbf{A1}:] basic assumptions needed to derived \eqref{eq:bound:vanilla}, for example,
    \begin{itemize}
    \item when \textbf{A1} is ``$\Theta$ is finite, $l(\cdot, \cdot)$ is a zero-one loss, samples are \textit{i.i.d}'',  $\phi(|\Theta|, n, \delta)=\sqrt{(\log(|\Theta|) + \log(1/\delta))/2n}$
    \item when \textbf{A1} is ``samples are \textit{i.i.d}'', $\phi(|\Theta|, n, \delta) = 2\mathcal{R}(\mathcal{L}) + \sqrt{(\log{1/\delta})/2n}$, where $\mathcal{R}(\mathcal{L})$ stands for Rademacher complexity and $\mathcal{L} = \{l_{\theta} \,|\, \theta \in \Theta \}$, where $l_{\theta}$ is the loss function corresponding to $\theta$. 
\end{itemize}
For more information, 
% or more concrete examples of the generic term, 
we refer interested readers to relevant textbooks such as \citep{bousquet2003introduction} for formal and intuitive discussions.
\end{itemize}


\subsection{Generalization Error Bound of Human-aligned Robust Models}
% Our interest lies in more than the expected performance over samples from the same distribution, 
% but over a different distribution that shares the same labeling function. 
% As we argue previously, 
% the key difficulty in learning a robust model 
% is the existence of the extra labeling function for features sampled from $\mP_s$. 
Formally, we state the challenge of our human-aligned robust learning problem as the assumption:
\begin{itemize}
    \item [\textbf{A2}:] \textbf{Existence of Misaligned Features:}
    For any $\x \in \mathcal{X}$, $\y := \h(\x)$. 
    We also have a $\s$ 
    that is different from $\h$, and for $\x \sim \mP_s$, 
    $\h(\x) = \s(\x)$. 
\end{itemize}
Thus, 
the existence of $\s$ is a key challenge for
the small empirical risk over $\mP_s$ 
to be generalized to $\mP_t$, 
because 
$\theta$ that learns either $\h$ or $\s$
will lead to small source error, 
but only $\theta$ that learns $\h$ will 
lead to small target error. 
Note that $\s$ 
may not exist for an arbitrary $\mP_s$. 
In other words, 
\textbf{A2} can be interpreted to ensure the a property  
of $\mP_s$ so that $\s$, while being different from $\h$, exists for any $\x \sim \mP_s$. 

In this problem, 
$\s$ and $\h$ are not the same 
despite $\s(\x)=\h(\x)$ for any $\x \sim \mP_s$, 
and we focus on the case where the differences 
lie in the features they use. 
To describe this difference, 
we introduce the notation $\mathcal{A}(\cdot,\cdot)$, 
which denotes a set parametrized by the labeling function and the sample, 
to describe the \emph{active set} of features used by the labeling function. 
By \emph{active set}, we refer to the minimum set of features that 
a labeling function requires to map a sample to its label. 
Formally, we define 
\begin{align}
\begin{split}
    & \mathcal{A}(f,\x) = \{i | \widehat{\z}_i = \x_i\}, \quad \textnormal{where,} \\
    & \widehat{\z} = \argmin_{\z \in \dom(f), f(\z)=f(\x)} \vert\{i \vert \z_i = \x_i\}\vert, 
    \label{eq:a:def}
\end{split}
\end{align}
and $\vert\cdot\vert$ measures the cardinality. 
Intuitively, $\mathcal{A}(f,\x)$ indexes the features $f$ uses to predict $\x$. 
Although $\s(\x)=\h(\x)$,  
$\mathcal{A}(\s,\x)$ and $\mathcal{A}(\h,\x)$ can be different.
$\mathcal{A}(\s,\x)$ is the \emph{misaligned features} 
following our definition. 
% \begin{align}
%     \mathcal{A}(f,\x) = \argmin_{\z \in \mathcal{X}, f(\z)=f(\x)} \vert\alpha_\x(\z)\vert,
% \end{align}
% where $\alpha_\x(\z) = \{i \vert \z_i = \x_i\}$ is the set of indices by which $\z$ and $\x$ are the same, and $\vert\cdot\vert$ measures the cardinality. 
% Although $\s(\x)=\h(\x)$,  
% $\mathcal{A}(\s,\x)$ and $\mathcal{A}(\h,\x)$ can be different. 

Further, we define a function difference given a sample as
\begin{align}
    d(\theta, f, \x) = \max_{\z \in \dom(f): \z_{\mathcal{A}(f,\x)}=\x_{\mathcal{A}(f,\x)}} |\theta(\z) - f(\z)|,
    \label{eq:d:def}
\end{align}
where $\x_{\mathcal{A}(f,\x)}$ denotes the features of $\x$ indexed by $\mathcal{A}(f,\x)$. 
In other words, the distance describes:
given a sample $\x$, 
the maximum disagreement 
of the two functions $\theta$ and $f$ 
for all the other data $\z \in \mathcal{X}$ with a constraint that 
the features indexed by $\mathcal{A}(f,\x)$ 
are the same as those of $\x$. 
Notice that this difference is not symmetric, as the 
active set is determined by the second function. 
By definition, we have
$d(\theta, f, \x) \geq \vert \theta(\x) -f(\x)\vert$.

Also, please notice that when we use expressions such as $\z_{\mathcal{A}(f,\x)}=\x_{\mathcal{A}(f,\x)}$, 
we imply that $\mathcal{A}(f,\x)$ is the same in both LHS and RHS.
Under this premise of the notation, 
whether \eqref{eq:a:def} has a unique solution or not will not affect our main conclusion. 

In addition, one may notice the connection between $\mathcal{A}(f,\x)$ and the minimum sufficient explanation discussed previously \citep[\textit{e.g.,}][]{camburu2020struggles,yoon2018invase,carter2019made,ribeiro2018anchors}. 
While $\mathcal{A}(f,\x)$ is conceptually the same as the minimum set of features for a model to predict, we define it mathematically different.  

To continue, we introduce the following assumption:
\begin{itemize}
    % \item [\textbf{A3}:] \textbf{Separable Labeling Functions:} 
    % For any $\x \in \mathcal{X}$, $\mathcal{A}(\h,\x) \cap \mathcal{A}(\s,\x) = \emptyset$
    \item [\textbf{A3}:] \textbf{Realized Hypothesis:} 
    Given a large enough hypothesis space $\Theta$, for any sample $(\x, \y)$, 
    for any $\theta \in \Theta$, 
    which is not a constant mapping, 
    if $\theta(\x)=\y$, then 
    $d(\theta, \h, \x)d(\theta, \s, \x)=0$
\end{itemize}

Intuitively, 
\textbf{A3} assumes $\theta$ at least
learns one labeling function
for the sample $\x$
if $\theta$ can map the $\x$ correctly. 

Finally, to describe how $\theta$ depends on the active set of $f$, we introduce the term 
\begin{align}
    r(\theta, \mathcal{A}(f,\x)) = \max_{\z_{\mathcal{A}(f,\x)} \in \dom(f)_{\mathcal{A}(f,\x)}} |\theta(\z) - \y|,
    \label{eq:r}
\end{align}
where $\z_{\mathcal{A}(f,\x)} \in \dom(f)_{\mathcal{A}(f,\x)}$ 
denotes that the features of $\z$ indexed by $\mathcal{A}(f,\x)$ are searched in the input space $\dom(f)$. 
Notice that $r(\theta, \mathcal{A}(f,\x))=1$ alone does not mean $\theta$ depends on the active set of $f$; 
it only means so when we also have $\theta(\x)=\y$ (see the formal discussion in Lemma~\ref{lemma:iff}).
In other words, $r(\theta, \mathcal{A}(f,\x))=1$ alone may not have an intuitive meaning, 
but given $\theta(\x)=\y$, $r(\theta, \mathcal{A}(f,\x))=1$ intuitively means $\theta$ learns $f$. 

With all above, we can extend the conventional generalization error bound with a new term as follows:
\begin{theorem}[The Curse of Universal Approximation]
With Assumptions \textbf{A1}-\textbf{A3}, $l(\cdot, \cdot)$ is a zero-one loss, with probability as least $1 - \delta$, we have 
\begin{align}
    \eptt \leq \wepst + c(\theta) + \phi(|\Theta|, n,  \delta)
\end{align}
where 
\begin{align*}
c(\theta) =  \dfrac{1}{n}\sum_{(\x, \y) \in (\X, \Y)_{\mP_s}} \mathbb{I}[\theta(\x)=\y]r(\theta, \mathcal{A}(\s,\x)).
\end{align*}
\label{thm:cua}
\end{theorem}

$\mathbb{I}[\cdot]$ is a function that returns $1$ if the 
condition $\cdot$ holds and 0 otherwise. 
As $\theta$ may learn $\s$, 
$\wepst$ is not representative of $\eptt$; 
thus, we introduce $c(\theta)$ to account for the discrepancy. 
Intuitively, $c(\theta)$ quantifies 
the samples that are correctly predicted, 
but only because the $\theta$ learns $\s$ for that sample. 
$c(\theta)$
depends on the knowledge of 
$\s$. 

We name Theorem~\ref{thm:cua}
\emph{the curse of universal approximation}
to highlight the fact 
that the existence of $\s$ is not always obvious, 
but the models can usually learn it nonetheless \citep{wang2020high} . 
% For example, 
% \citet{ilyas2019adversarial} suggest the root to the
% performance drop over adversarial examples are spurious features, and
% \citet{wang2020high} demonstrate the existence 
% of human-imperceptible high-frequency spurious signals 
% in image datasets, 
% which may explain several generalization issues of the models. 
% In other words, 
Even in a well-curated dataset
that does not seemingly have misaligned features, 
modern models 
might still use some features not understood by human.
% leading to non-robust behaviors 
% when tested over other datasets
% that human consider similar. 
This argument may also align with
recent discussions suggesting 
that
reducing the model complexity 
can improve cross-domain generalization \citep{chuang2020estimating}. 

\subsection{In Comparison to the View of Domain Adaptation}
\label{sec:cua:da}

We continue to compare Theorem~\ref{thm:cua} with 
understandings of domain adaptation. 
Conveniently, several domain adaptation analyses \citep{ben2007analysis,ben2010theory,MansourMR09,GermainHLM16,ZhangLLJ19,dhouib2020margin} can be sketched in the following form:
\begin{align}
    \eptt \leq \wepst + D_\Theta(\mP_s, \mP_t) + \lambda + \phi'(|\Theta|, n, \delta)
\end{align}
where $D_\Theta(\mP_s, \mP_t)$ quantifies the differences between the two distributions; 
$\lambda$ describes the nature of the problem 
and usually involves non-estimable terms about the problem. 

For example, \cite{ben2010theory} formalized the difference as $\Theta$-divergence, and described the corresponding empirical term as (with $\Theta\Delta\Theta$ denoting the set of disagreement between two hypotheses in $\Theta$): 
\begin{align}
\begin{split}
    D_\Theta(\mP_s, \mP_t) = & 
    1 - \min_{\theta \in \Theta\Delta\Theta}(\dfrac{1}{n}\sum_{\x:\theta(\x)=0}\mathbb{I}[\x \in (\X, \Y)_{\mP_s}] \\
    &+ \dfrac{1}{n}\sum_{\x:\theta(\x)=1}\mathbb{I}[\x \in (\X, \Y)_{\mP_t}]).
    \label{eq:h-divergence}
\end{split}
\end{align}
%where $m$ denotes the number of unlabelled samples in $\mP_s$ and $\mP_t$ each. 
Also, \cite{ben2010theory} formalized
$\lambda = \epsilon_{\mP_t}(\theta^\star) + \epsilon_{\mP_s}(\theta^\star)$, 
where 
$\theta^\star = \argmin_{\theta\in\Theta}\epsilon_{\mP_t}(\theta) + \epsilon_{\mP_s}(\theta)$, 

In our discussion, as we assume the $\h$ applies to any $\x \in \mathcal{X}$ (according to \textbf{A2}), $\lambda=0$ as long as the hypothesis space is large enough. Therefore, the comparison mainly lies in comparing $c(\theta)$ and $D_\Theta(\mP_s, \mP_t)$.

To compare them, 
we need an extra assumption:
\begin{itemize}
    \item [\textbf{A4}:] \textbf{Sufficiency of Training Samples}
    for the two finite datasets in the study, 
    \textit{i.e.}, $(\X,\Y)_{\mP_s}$ and $(\X,\Y)_{\mP_t}$, 
    for any $\x \in (\X,\Y)_{\mP_t}$, 
    there exists one or many $\z \in (\X,\Y)_{\mP_s}$ such that 
    \begin{align}
        \x \in \{\x'| \x' \in \mathcal{X} \; \textnormal{and} \; 
        \x'_{\mathcal{A}(f_h,\z)} = \z_{\mathcal{A}(f_h,\z)}
        \}
    \end{align}
\end{itemize}

\textbf{A4} intuitively means
the finite training dataset needs to be diverse enough to 
describe the concept that needs to be learned. 
For example, imagine building a classifier to classify mammals \textit{vs.} fishes from the distribution of photos to that of sketches, 
we cannot expect the classifier to do anything good on dolphins if dolphins only appear in the test sketch dataset. 
\textbf{A4} intuitively regulates that
if dolphins will appear in the test sketch dataset, 
they must also appear in the training dataset. 

Now, 
in comparison to \citep{ben2010theory},
we have
\begin{theorem}
With Assumptions \textbf{A2}-\textbf{A4}, 
and if $1 - f_h \in \Theta$, 
we have
\begin{align}
\begin{split}
  c(\theta) \leq & D_\Theta(\mP_s, \mP_t) \\& + \dfrac{1}{n}\sum_{(\x,\y) \in (\X, \Y)_{\mP_t}} \mathbb{I}[\theta(\x)=\y]r(\theta, \mathcal{A}(f_m,\x))
\end{split}
\end{align}
where 
\begin{align*}
    c(\theta) =  \dfrac{1}{n}\sum_{(\x,\y) \in (\X, \Y)_{\mP_s}} \mathbb{I}[\theta(\x)=\y]r(\theta, \mathcal{A}(f_m,\x))
\end{align*}
and $D_\Theta(\mP_s, \mP_t)$ is defined as in~\eqref{eq:h-divergence}. 
\label{thm:comparison}
\end{theorem}

% The comparison %between $c(\theta)$ and $D_\Theta(\mP_s, \mP_t)$
% involves an extra term, 
$q(\theta) := \frac{1}{n}\sum_{(\x,\y) \in (\X, \Y)_{\mP_t}} \mathbb{I}[\theta(\x)=\y] r(\theta, \mathcal{A}(f_m,\x))$, which intuitively means that 
if $\theta$ learns $f_m$, 
how many samples $\theta$ can coincidentally predict correctly  over the finite target set
used to estimate $D_\Theta(\mP_s, \mP_t)$. 
%We name this extra term $q(\theta)$. 
For sanity check, 
if we replace $(\X, \Y)_{\mP_t}$ with $(\X, \Y)_{\mP_s}$, 
$D_\Theta(\mP_s, \mP_t)$ will be evaluated at 0 as it cannot differentiate two identical datasets, 
and $q(\theta)$ will be the same as $c(\theta)$. 
On the other hand, 
if no samples from $(\X, \Y)_{\mP_t}$
can be mapped correctly with $f_m$ (coincidentally), 
$q(\theta)=0$ and 
$c(\theta)$ will be a lower bound of $D_\Theta(\mP_s, \mP_t)$. 
% To sum up, 
% the relationship between $c(\theta)$ 
% and $D_\Theta(\mP_s, \mP_t)$ 
% depends on the finite target dataset used to estimate $D_\Theta(\mP_s, \mP_t)$ as in 
% how many samples can $\theta$ coincidentally predict correctly 
% by learning $f_m$. 

The value of Theorem~\ref{thm:comparison} 
lies in the fact that 
for an arbitrary target dataset $(\X,\Y)_{\mP_t}$, 
no samples out of which can be predicted correctly 
by learning $f_m$ (a situation likely to occur for arbitrary datasets since $f_m$ is unlikely to be shared across the source dataset and any arbitrary target dataset), 
$c(\theta)$ will always be a lower bound of $D_\Theta(\mP_s, \mP_t)$. 

Further, 
when Assumption \textbf{A4} does not hold, 
we are unable to derive a clear relationship between 
$c(\theta)$ and $D_\Theta(\mP_s, \mP_t)$. 
The difference is mainly raised as a matter of fact that, 
intuitively, 
we are only interested in the problems that are ``solvable'' 
(\textbf{A4}, \textit{i.e.}, hypothesis that used to reduce the test error in target distribution can be learned from the finite training samples) 
but ``hard to solve'' 
(\textbf{A2}, \textit{i.e.}, another labeling function, namely $f_m$, 
exists for features sampled from the source distribution only), 
while $D_\Theta(\mP_s, \mP_t)$ estimates the divergence of two arbitrary distributions. 

% We continue to compare Theorem~\ref{thm:cua} with 
% understandings of domain adaptation. 
% Conveniently, several domain adaptation analyses \citep{ben2007analysis,ben2010theory,MansourMR09,GermainHLM16,ZhangLLJ19,dhouib2020margin} can be sketched in the following form:
% \begin{align}
%     \eptt \leq \wepst + D_\Theta(\mP_s, \mP_t) + \lambda + \phi'(|\Theta|, n, \delta)
% \end{align}
% where $D_\Theta(\mP_s, \mP_t)$ quantifies the differences between the two distributions; 
% $\lambda$ describes the nature of the problem 
% and usually involves non-estimable terms about the problem or the distributions. 

% For example, \cite{ben2010theory} formalized the difference as $\Theta$-divergence, and described the corresponding empirical term as (with $\Theta\Delta\Theta$ denoting the set of disagreement between two hypotheses in $\Theta$): 
% \begin{align}
% \begin{split}
%     D_\Theta(\mP_s, \mP_t) = &
%     1 - \min_{\theta \in \Theta\Delta\Theta}(\dfrac{1}{n}\sum_{\x:\theta(\x)=0}\mathbb{I}[\x \in (\X, \Y)_{\mP_s}]  \\
%     &+ \dfrac{1}{n}\sum_{\x:\theta(\x)=1}\mathbb{I}[\x \in (\X, \Y)_{\mP_t}]).
%     \label{eq:h-divergence}
% \end{split}
% \end{align}
% %where $m$ denotes the number of unlabelled samples in $\mP_s$ and $\mP_t$ each. 
% Also, \cite{ben2010theory} formalized
% $\lambda = \epsilon_{\mP_t}(\theta^\star) + \epsilon_{\mP_s}(\theta^\star)$, 
% where 
% $\theta^\star = \argmin_{\theta\in\Theta}\epsilon_{\mP_t}(\theta) + \epsilon_{\mP_s}(\theta)$, 

% In our discussion, 
% as we assume the $\h$ applies to any $\x \in \mathcal{X}$ (according to \textbf{A2}), $\lambda=0$ as long as the hypothesis space is large enough.
% Therefore, 
% the comparison mainly lies in comparing 
% $c(\theta)$ and $D_\Theta(\mP_s, \mP_t)$.

% To compare, we need an additional assumption (\textbf{A4} in Appendix~\ref{sec:app:da}), 
% which intuitively means
% the finite training dataset needs to be diverse enough to 
% describe the concept that needs to be learned. 
% For example, imagine building a classifier to classify mammals \textit{vs.} fishes from the distribution of photos to the distribution of sketches. 
% We cannot expect the classifier to do anything good on dolphins if dolphins only appear in the test sketch dataset. 
% This assumption intuitively regulates that
% if dolphins appear in the test sketch dataset, 
% they must also appear in the training dataset. 

% In comparison to the error bound of \citep{ben2010theory},
% we have
% \begin{theorem}[Informally, see the formal description in Appendix~\ref{sec:app:da}]
% With Assumptions \textbf{A2}-\textbf{A4}, 
% and if $1 - \h \in \Theta$, 
% if no samples from the finite target set
% used to estimate $D_\Theta(\mP_s, \mP_t)$
% can be predicted correctly by $\theta$ through learning $\s$, 
% we have
% $c(\theta) \leq D_\Theta(\mP_s, \mP_t)$. 
% $D_\Theta(\mP_s, \mP_t)$ is define as in \eqref{eq:h-divergence}. 
% \label{thm:comparison}
% \end{theorem}

% In other words, Theorem~\ref{thm:comparison} 
% suggests that 
% for a target dataset $(\X,\Y)_{\mP_t}$
% whose samples cannot be predicted correctly 
% by $\theta$ through learning $\s$, 
% $c(\theta)$ will always be a lower bound of $D_\Theta(\mP_s, \mP_t)$.

\subsection{Estimation of the Discrepancy}
The estimation of $c(\theta)$ mainly 
involves two challenges: 
the requirement of the knowledge of $\s$ 
and the computational cost to search over the entire space $\mathcal{X}$. 

The first challenge is unavoidable by definition
because the human-aligned learning has to be built upon 
the prior knowledge of what labeling function a human considers similar (what $\h$ is)
or its opposite (what $\s$ is). 
Fortunately, 
as discussed in Section~\ref{sec:related}, 
the methods are usually developed with prior knowledge of what the misaligned features are, 
suggesting that we may often directly have the knowledge.
% of $\mathcal{A}(\s,\x)$.

The second challenge is about the computational cost to search, 
and the community has several techniques to help reduce the burden. 
For example, the search can be terminated 
once $r(\theta, \mathcal{A}(\s,\x))$ is evaluated as $1$ 
(\textit{i.e.}, once we find a perturbation of misaligned features that alters the prediction).
This procedure is similar to how adversarial attack \citep{goodfellow2015explaining} 
is used to evaluate the robustness of models. 
To further reduce the computational cost, 
one can also generate out-of-domain data by perturbing misaligned features beforehand
and use these fixed data to test models. 
Using fixed data to evaluate might not be as accurate as 
using a search process, 
but sometimes, it can be good enough to reveal some interesting properties of the models \citep{Jo2017, geirhos2018imagenettrained, wang2020high}. 

\section{Methods to Learn Human-aligned Robust Models}
\label{sec:robust}
We continue to study how our analytical results above can lead to 
practical methods to learn human-aligned robust models. 
We first show that our discussion
can naturally connect to existing methods for robust machine learning 
discussed in Section~\ref{sec:related}. 
% Further, as these methods 
% mostly require some prior knowledge of the misaligned features, 
% we continue to explore a new method that does not require so. 

Theorem~\ref{thm:cua} suggests that
training a human-aligned robust model amounts to training 
for small $c(\theta)$ and small empirical error (\textit{i.e.}, $\wepst$). 

\subsection{Worst-case Training}
To simplify the notation, we define $\mathcal{Q}(\x):= \{ \x_{\mathcal{A}(\s,\x)} \in \dom(\s)_{\mathcal{A}(\s,\x)} \}$. 
We can consider the upper bound of $c(\theta)$
\begin{align}
\begin{split}
    c(\theta) 
    \leq &\dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)} r(\theta, \mathcal{A}(\s,\x)) \\
    = & \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)} \max_{\z \in \mathcal{Q}(\x)} |\theta(\z) - \y|,
    \label{eq:method:worst}
\end{split}
\end{align}
which intuitively means that 
instead of $c(\theta)$ that studies only the correct predictions because $\theta$ learns $\s$, 
now we study any predictions because $\theta$ learns $\s$. 

Further, as 
\begin{align*}
    |\theta(\x) - \y| \leq \max_{\z \in \mathcal{Q}(\x)} |\theta(\z) - \y|, 
\end{align*}
a model with minimum $\eqref{eq:method:worst}$ naturally means 
the model will have a minimum empirical loss. 
Therefore, we can train for a small $\eqref{eq:method:worst}$, which likely leads to the model with a small empirical loss. 
Therefore, 
after we replace $|\theta(\x) - \y|$ with a generic loss term $\ell(\theta(\x),\y)$,
we can directly train a model with
\begin{align}
    \min_{\theta \in \Theta} \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)}  \max_{\z \in \mathcal{Q}(\x)} \ell(\theta(\z),\y)
    \label{eq:method:worst-da-result}
\end{align}
to get a model with small $c(\theta)$ and small empirical error. 

The above method is to augment the data by perturbing the misaligned features to maximize the training loss 
and solve the optimization problem with the augmented data. 
This method is the worst-case data augmentation method \citep{FawziSTF16} we discussed previously,
and is also closely connected to one of the most widely accepted methods 
for the adversarial robust problem, namely the adversarial training \citep{MadryMSTV18}.

While the above result shows that a method for learning human-aligned robust models is in mathematical connection to the worst-case data augmentation, 
in practice, a general application of this method will require some additional assumptions. 
The detailed discussions of these are in the appendix. 
\label{sec:worst-da}

We continue from the RHS of \eqref{eq:method:worst} to discuss another reformulation by reweighting sample losses for optimization, which leads to:
\begin{align}
    \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)} \max_{\z \in \mathcal{Q}(\x)} \lambda(\z)|\theta(\z) - \y|
    \label{eq:method:worst2}
\end{align}

% {\color{red}
% Remaining questions
% \begin{itemize}
%     \item Why or when $c(\theta) \leq$ the LHS of \eqref{eq:method:worst2}
% \end{itemize}
% }
The conditions (assumptions) that we need for $c(\theta) \leq$ the LHS of \eqref{eq:method:worst2} is discussed in the appendix. 
Now, we will continue with  
\begin{align}
    c(\theta) \leq \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)} \max_{\z \in \mathcal{Q}(\x)} \lambda(\z)|\theta(\z) - \y|
    \label{eq:method:worst2:2}
\end{align}

When \eqref{eq:method:worst2:2} holds, replacing $|\theta(\z) - \y|$ with a generic loss $\ell(\theta(\z),\y)$ and minimizing it is another direction of learning robust models, which corresponds to distributionally robust optimization (DRO) \citep{ben2013robust, duchi2021statistics}. 

Further, depends on implementations of $\lambda(\x)$, 
DRO has been implemented with different concrete solutions, 
sometimes with structural assumptions \citep{hu2018does}, 
such as
\begin{itemize}
    \item Adversarially reweighted learning (ARL) \citep{NEURIPS2020_07fc15c9}
    uses another model $\phi: \mathcal{X} \times \mathcal{Y} \rightarrow[0,1]$ to identify samples with misaligned features that cause high losses of model $\theta$ and defines \begin{align*}
        \lambda(\x)=1+\vert(\X, \Y)\vert \cdot \frac{\phi\left(\x\right)}{\sum_{(\x,\y) \in (\X, \Y)} \phi\left(\x\right)}
    \end{align*}
    \item Learning from failures (LFF) \citep{nam2020learning} also trains another model $\phi$ by amplifying its early-stage predictions and defines 
    \begin{align}
        \lambda(\x)=\frac{\ell\left(\phi(\x), \y\right)}{\ell\left(\phi(\x), \y\right)+\ell\left(\theta(\x), \y\right)}
    \end{align}
    \item Group DRO \citep{Sagawa*2020Distributionally} assumes the availability of the structural partition of the samples, and defines the weight of samples at partition $\mathbf{g}$ as
    \begin{align}
     \lambda(\x)= \dfrac{\exp \left(\ell\left(\theta(\x), \y)\right)\right)}{\sum_{(\z,\y) \in (\X, \Y)_\mathbf{g}} \exp \left(\ell\left(\theta(\z), \y)\right)\right)},
    \end{align}
    if $(\x, \y) \in (\X, \Y)_\mathbf{g}$, samples of partition $\mathbf{g}$
\end{itemize}
These discussions are expanded in the appendix. 


% \subsubsection{Connections to Distributionally Robust Optimization (DRO)}
% We generalize the above analysis of worst-case data augmentation to a DRO problem \citep{ben2013robust, duchi2021statistics}. Given $n$ data points, consider a perturbation set $\mathcal{Q} := \{ \mathbf{x}_{\mathcal{A}(f_m, \mathbf{x}_i)} \in \operatorname{dom}(f)_{\mathcal{A}(f_m, \mathbf{x}_i)} \}_{i=1}^n$ encoding the features of $\x$ indexed by $\mathcal{A}(f,\x)$ over input space $\dom(f_m)$. Denote $q(\x, \y)$ and $p(\x,\y)$ are densities from the $\mathcal{Q}$ and training distribution $\mathcal{X} \times \mathcal{Y}$, respectively. Then (\ref{eq:method:worst-da-result}) can be rewritten as a DRO problem over a new distribution $\mathcal{Q}$.

% \begin{equation}
%     c(\theta) \le \min_{\theta \in \Theta} \max_{(\x_i,\y_i)\in \mathcal{Q}} \dfrac{1}{n}\sum_{i=1}^{n} \ell(\theta(\x_i),\y_i)
% \end{equation}

% $\mathcal{Q}$ encodes the priors about feature perturbation that model should be robust to. Therefore, choosing $f$-divergence as the distance metric where $f$ is convex with $f(1) = 0$, $\delta > 0$ as a radius to control the degree of the distribution shift, adversarial robustness in Section~\ref{sec:worst-da} can be viewed as an example of DRO on an infinite family of distributions with implicit assumptions that samples in $\mathcal{Q}$ are visually indistinguishable from original ones. For $p$ and $q$ that $p(\x, \y) = 0$ implies $q(\x, \y) = 0$, we arrive at a generic weighted risk minimization (WRM) formulation \citep{NIPS2016_4588e674, duchi2021statistics} when weights (by default as density ratios) $\lambda_\phi = q(\x_i, \y_i)/p(\x_i,\y_i) < 1$ in (\ref{eq:method:wrm_loss}) derived from misaligned functions for 
% \begin{align}
% \begin{split}
%     c(\theta) \le \min _{\theta \in \Theta} \max _{\lambda_\phi \in \mathcal{U}_f} \frac{1}{n} \sum_{i=1}^{n} \lambda_{\phi}\left(\x_{i}, \y_{i}\right) \cdot \ell\left(\theta\left(\x_{i}\right), \y_{i}\right)
%     %\label{eq:method:wrm}
%     \label{eq:method:wrm_loss}
% \end{split}
% \end{align}

% where the uncertainty set $\mathcal{U}_f$ is reformulated as 
% \begin{align}
% %\begin{equation}
% \mathcal{U}_f := \{\lambda_\phi(\x_i,\y_i)| & D_f(q(\x_i,\y_i)||p(\x_i,\y_i)) \le \delta, \\ & \sum_{i=1}^{n}\lambda_\phi(\x_i,\y_i)=1, \\ & \forall \lambda(x_i,y_i)\ge 0 \}
% %\end{equation}
% \end{align}

% \hl{Equivalence}
% %%%%%%%%%%%

% Recall the DRO formulation as
% \begin{align}
%     \min_{\theta \in \Theta} \max_{\z \in \mathcal{Q}(\x)} \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)}  \ell(\theta(\x),\y)
%     \label{eq:method:worst-da-result}
% \end{align}

% To show the equivalence of DRO and WRM, we only need to show the a local minimum
% of an expected risk mixture is a DRO local minimum.

% Assuming 
% Denote $\theta^* \in \Theta$ and  $\z^*$ as the optimal optimal hypothesis of the vanilla minimax DRO problem (\ref{eq:method:worst-da-result}) and its maximin problem (\ref{eq:method:maximin}), respectively.

% \begin{align}
%     \max_{\z \in \mathcal{Q}(\x)} \min_{\theta \in \Theta} \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)}  \ell(\theta(\x),\y)
%     \label{eq:method:maximin}
% \end{align}

% %%%%%%%%%%%%%
% \hl{Equivalence}

% Intuitively, learner $\theta$ and adversary $\phi$ are playing a minimax game where $\phi$ finds worst-case weights and computationally-identifiable regions of errors to improve the robustness of the learner $\theta$. In this scenario, we unify a line of WRM approaches where weights $\lambda_\phi$ are mainly determined by misaligned features $\mathcal{A}(f_m, \mathbf{x})$, either parameterized by a biased model or derived from some heuristic statistics. 

% %We consider an extended active sets $\tilde{\mathcal{A}}$ defined as
% %\begin{equation}
% %    \tilde{\mathcal{A}} = (\mathop{\cup}\limits_{i=1}^{n} \mathcal{A}(f_m,\mathbf{x}_i) ) \bigcup (\mathop{\cup}\limits_{j=1}^{n} \mathcal{A}(f_h,\mathbf{x}_j) )
% %\end{equation}
% \begin{comment}
% Following Eq. (\ref{eq:method:worst-da-result})
% \begin{align}
% \begin{split}
%     c(\theta) 
%     & \leq \min_{\theta \in \Theta} \max_{\mathbf{x}_{\tilde{\mathcal{A}}}} 
%     \dfrac{1}{n}\sum_{i=1}^{n}  \ell(\theta(\x_i),\y_i) \\ & = 
%     \min_{\theta \in \Theta} \dfrac{1}{n}\sum_{i=1}^{n}  \ell(\frac{w_{i} \theta(\x_{i})}{\sum w_{i}},\y_i) \\ & 
%     \leq \min_{\theta \in \Theta} \dfrac{1}{n} \frac{w_{i} \ell(\theta(\x_i),\y_i)}{\sum w_{i}} \\ &
%     \leq \min_{\theta \in \Theta} \dfrac{\lambda_m}{n} \sum_{i=1}^{n} \ell(\theta(\x_i),\y_i)
%     \label{eq:method:wrm_loss}
%     % (\x,\y)\in (\X,\Y) 
% \end{split}
% \end{align}
% \end{comment}
% %Adversarial WRM with f-divergence. \hl{Worst-case parameterization or simply assigns larger weight $w_i$ to data $(x_i, y_i)$ with a larger loss.}
% % where $\lambda_m \ge 1$ is a constant and $w_i$ is a sample-wise weight such that $\lambda_\phi$ in (\ref{eq:method:wrm}) is converted as $w_i/\sum w_i$ with various design choices.

% %The first inequality extends the features by the union of active sets. The second equality follows the assumption that the training data can be approximated by the mixture of other data points, normalized by the sum of weights. Intuitively, the assumption shows that searching the worst-case sample over misaligned and aligned feature spaces are equivalent to WRM, which is a common practice when data contains meta information about subpopulation under mild conditions \citep{rockafellar2000optimization, duchi2020distributionally, Sagawa*2020Distributionally}. Here we consider the most generic case where covariates are form by mixing any pair of data points from raw dataset\footnote{can also be seen as treating each data point as a subpopulation}. By the convexity of cross-entropy and Jensen inequality, we arrive at the third inequality. Then for variants of WRM methods, differences are only the determined of mixing coefficients $w_i$ or $\lambda_m$. Therefore, the learning via WRM is essentially minimizing the upper bound of $c(\theta)$.

% \begin{comment}
% \begin{align}
% \begin{split}
%     c(\theta) 
%     & \leq \lambda_{m} c(\theta) + \lambda_{h} \tilde{c}(\theta) \\ & = \dfrac{\lambda_{m}}{n}\sum_{(\x,\y)\in (\X,\Y)_{\mP_s}} \mathbb{I}[\theta(\x)= \y] r(\theta, \mathcal{A}(\s,\x)) \\ & + \dfrac{\lambda_h}{n}\sum_{(\x, \y) \in (\X, \Y)_{\mP_s}} \mathbb{I}[\theta(\x)\ne \y]r(\theta, \mathcal{A}(\s,\x)) \\
%     \label{eq:method:generic}
% \end{split}
% \end{align}
% \end{comment}

% On one hand, for those $\phi$ is parameterized as a second player, we consider the following two approaches:

% \textbf{Adversarially reweighted learning (ARL) \citep{NEURIPS2020_07fc15c9}} is a variant of DRO that uses an adversary model $f_\phi$ (for a slight abuse of notation) to identify samples with misaligned features and regions where the learner makes significant errors. Similar to (\ref{eq:method:wrm_loss}), it learns worst-case per-example importance weights for training $\theta$ via minimizing the upper bound of $c(\theta)$, where $\lambda_{\phi}\left(\x_{i}, \y_{i}\right)=1+n \cdot \frac{f_{\phi}\left(\x_{i}, \y_{i}\right)}{\sum_{i=1}^{n} f_{\phi}\left(\x_{i}, \y_{i}\right)}$. Note that we only consider the first iteration of ARL and it is straightforward to extend the analysis to an iterative version.

% \textbf{Learning from failures (LFF) \citep{nam2020learning}} trains a biased neural network $\phi$ by amplifying its early-stage predictions and a debiased neural network $\theta$ by focusing on samples that the biased model struggles to learn.Then re-weight training samples using the relative difficulty score based on the loss of the biased model and the debiased model. Assuming $\theta$ and $\phi$ are from the same hypothesis space $\Theta$, we have $\phi$ trained by the generalized cross entropy loss as 

% \begin{align}
% \begin{split}
%     \min_{\phi \in \Theta} \frac{1}{n} \sum_{i=1}^{n} \ell_{\operatorname{gce}}\left(\phi(\x_i), \y_i\right)
%     \label{eq:method:lff_biased} 
% \end{split}
% \end{align}

% LFF serves as a special case of (\ref{eq:method:wrm_loss}) that $\theta$ is learned by optimizing (\ref{eq:method:wrm_loss}) when $\lambda_\phi(\x_i,\y_i)=\frac{\ell_{\operatorname{ce}}\left(\phi(\x_i), \y_i\right)}{\ell_{ce}\left(\phi(\x_i), \y_i\right)+\ell\left(\theta(\x_i), \y_i\right)}$ is the weighted average of the cross entropy prediction of $\phi$ and $\theta$.

% \begin{comment}
% \begin{align}
% \begin{split}
%     c(\theta) \le \min_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^{n} \mathcal{W}(\x_i) \cdot \ell \left(\theta(\x_i), \y_i\right) \label{eq:method:lff_debiased} 
% \end{split}
% \end{align}

% where \begin{align}
% \begin{split}
% \mathcal{W}(x)=\frac{\ell_{\operatorname{ce}}\left(\phi(\x_i), \y_i\right)}{\ell_{ce}\left(\phi(\x_i), \y_i\right)+\ell\left(\theta(\x_i), \y_i\right)}
% \label{eq:method:lff_weight}
% \end{split}
% \end{align}
% \end{comment}

% On the other hand, in the following special cases, $\phi$ is a constant chosen by some approximate solutions or heuristics.

% \textbf{DRO with structural assumptions \citep{hu2018does}} takes a perspective that the training distribution is a mixture of multiple groups such that the perturbation set $\mathcal{Q}$ is composed of any mixture of these groups. % KL-DRO \citep{hu2013kullback}, and Conditional value at risk (CVaR) \citep{rockafellar2000optimization}. encoding a pre-determined family of distributions or parametric generative models

% We consider the GroupDRO \citep{Sagawa*2020Distributionally} variant, which assumes data can be grouped by available meta-information such as demographics. Intuitively, data within the group are likely to share the same distribution. Defining the perturbation set according to group labels or demographic attributes, GroupDRO re-samples the training data for learning model parameters according to the worst group loss.
% % and CVaR or the average of the $\lceil \alpha N\rceil$ ($\alpha \in [0,1]$, N is the amount of population) largest. , respectively.
% \begin{align}
% \begin{split}
% c(\theta) \le \min_{\theta \in \Theta} \frac{1}{n} \sum_{g=1}^{m} \lambda_{g} \sum_{i=1}^{|P_{g}|}\ell(\theta ;(\x_i, \y_i))
% \end{split}
% \end{align}
% where $P_g$ is indexed by $m$ groups forming the entire dataset as $\mathcal{P}=\left\{P_{g}\right\}_{g=1}^{|\mathcal{P}|}$. In practice, $\lambda_g$ is determined by $\exp \left(\ell\left(\theta(\x_i), \y_i)\right)\right)/ \sum\limits_{j=1}^{n} \exp \left(\ell\left(\theta(\x_j), \y_j)\right)\right)$ for any $(\x_i, \y_i) \in P_g$. Since samples within the same group are assigned weights homogeneously, obviously it is a special case of (\ref{eq:method:wrm_loss}).

% %KL-DRO shows worst-case expectation admits an analytically tractable solution and can be levaeraged to convert KL constrained DRO into per-sample WRM problem, which is essentially optimizing the (\ref{eq:method:wrm_loss}) objective.

% \textbf{Just train twice (JTT) \citep{pmlr-v139-liu21f}} works in a case where failure samples from ERM are resampled to form a two-group optimization problem. Intuitively, JTT upweights the identified failure samples (denoted as $\mathcal{D}_{f}$) with misaligned features $\lambda_{up}$ times and retrain the model for a second time by essentially optimizing the upper bound of $c(\theta)$ in (\ref{eq:method:wrm_loss}) where $\lambda_\phi$ are set to be $(\lambda_{up}+1)$ for failure samples and $1$ for correctly predicted samples in the first round

% %\begin{comment}
% \begin{align}
% \begin{split}
%     c(\theta) 
%     & \leq \min _{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^{n} (\lambda_{up}[\mathbbm{1}[(\x_i,\y_i) \in \mathcal{D}_{f}]]+1)  \ell\left(\theta\left(\x_{i}\right), \y_{i}\right)
%     %c(\theta) + \tilde{c}(\theta)
%     \label{eq:method:jtt}
% \end{split}
% \end{align}
% %\end{comment}

\subsection{Regularizing the Hypothesis Space}
Connecting our theory to the other main thread is little bit tricky, 
as we need to extend the model to an encoder/decoder structure, where we use $e_\theta$ and $d_\theta$ to denote them respectively. 
Thus, by definition of classification models, we have $\theta(\x) = d_\theta(e_\theta(\x))$.  
Further,
we define $\s'$ as the equivalent of $\s$ with the only difference is that 
$\s'$ operates
on the representations $e_\theta(\x)$. 
With the setup, 
optimizing the empirical loss and $c(\theta)$ leads to (details in the appendix):
\begin{align}
    \min_{d_\theta, e_\theta}\dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)}\ell(d_\theta(e_\theta(\x)), \y) - \ell(\s'(e_\theta(\x)), \y), 
    \label{eq:regularization}
\end{align}
which is highly related to methods used to learn auxiliary-annotation-invariant representations, 
and the most popular example of these methods is probably DANN \citep{GaninUAGLLML16}. 

Then, the question left is how to get $\s'$. 
We can design a specific architecture given the prior knowledge of the data, then $\s'$ can be directly estimated through 
\begin{align}
    \min_{\s'} \dfrac{1}{n}\sum_{(\x,\y)\in (\X,\Y)} \ell(\s'(e_\theta(\x)),\y), 
    \label{eq:regularization:1}
\end{align}
which connects to several methods in Section~\ref{sec:related}, such as \citep{WangGLX19, bahng2019learning}. 
Alternatively, we can estimate $\s'$ with additional annotations (\textit{e.g.}, domain ids, batch ids \textit{etc}), then we can estimate the model by (with $\mathbf{t}$ denoting the additional annotation) 
\begin{align}
    \min_{\s'} \dfrac{1}{n}\sum_{(\x,\mathbf{t})\in (\X,\mathbf{T})} \ell(\s'(e_\theta(\x)),\mathbf{t}),
    \label{eq:regularization:2}
\end{align}
which connects to methods in domain adaptation literature such as \citep{GaninUAGLLML16,li2018domain}. 

%%% Discussion of the connection (to put in the appendix) %%% 

\subsection{A New Heuristic: Worst-case Training with Regularized Hypothesis Space}

Our analysis showed that optimizing for small $c(\theta)$ naturally connects to one of the two mainstream families of methods used to train robust models in the literature, 
which naturally inspires us to invent a new method by combining these two directions. 
The intuition behind this design rationale is to incorporate the empirical strength of each of these methods together by directly combining the major components of these methods.

Therefore, we introduce a new heuristic that combines
the worst-case training \eqref{eq:method:worst} 
and the regularization method \eqref{eq:regularization} and \eqref{eq:regularization:2}, for which, whether the samples are originally from $(\X, \Y)$ or generated along the training will serve as the additional annotation $\mathbf{t}$. 

\begin{algorithm}
\small 
\SetAlgoLined
\KwResult{$\theta^I$}
\textbf{Input:} total iterations $I$, $(\X,\Y)$\; 
 initialize $\theta^{(0)}$, $i=1$\;
 \While{$i \leq I$}{
 \For {sample $(\x,\y)$}{
    assign additional label $\mathbf{t}_\x=0$ for $\x$\;
    sample $\z \in Q(\x)$ that maximizes $\ell(\theta(\x), \y)$\;
    assign additional label $\mathbf{t}_\z=1$ for $\z$\;
    update $f'_m$ with \eqref{eq:regularization:2}\;
    update $\theta$ with \eqref{eq:regularization}\;
    update $\theta$ with $\z$ with the equivalence of \eqref{eq:regularization} as 
    $
        \min_{d_\theta, e_\theta}\ell(d_\theta(e_\theta(\z)), \y) - \ell(\s'(e_\theta(\z)), \y), 
    $
 }
 }
 \caption{worst-case training with regularized hypothesis space}
 \label{algorithm:main}
\end{algorithm}  

In particular, our heuristic is illustrated with Algorithm~\ref{algorithm:main}. 
In practice, we will also introduce a hyperparamter to balance the two losses in \eqref{eq:regularization}. 

\section{Experiments}
\label{sec:exp}
\begin{table*}[t]
\small 
\centering 
\begin{tabular}{cccccccccccc}
\hline
 & Vanilla & SN & LM & RUBi & ReBias & Mixup & Cutout & AugMix & WT & Reg & WR \\ \hline
Standard Acc. & 90.80 & 88.40 & 67.90 & 90.50 & 91.90 & 92.50 & 91.20 & 92.90 & 92.50 & 93.10 & \textbf{93.30} \\
Weighted Acc. & 88.80 & 86.60 & 65.90 & 88.60 & 90.50 & 91.20 & 90.30 & 91.70 & 91.30 & \textbf{92.20} & 92.00 \\
ImageNet-A & 24.90 & 24.60 & 18.80 & 27.70 & 29.60 & 29.10 & 27.30 & \textbf{31.50} & 28.50 & 30.00 & 29.60 \\
ImageNet-Sketch & 41.10 & 40.50 & 36.80 & 42.30 & 41.80 & 40.60 & 38.70 & 41.40 & 43.00 & 42.50 & \textbf{43.20} \\
average & 61.40 & 60.03 & 47.35 & 62.28 & 63.45 & 63.35 & 61.88 & 64.38 & 63.83 & 64.45 & \textbf{64.53} \\ \hline
\end{tabular}
\caption{Results comparison on nine super-class ImageNet classification. }
\label{tab:miniimagenet}
\end{table*}

We presented the theory supporting experiments in Appendix, and discuss performance competing results here. 

% \paragraph{Image classification}
To test the performance of our new heuristic, 
we compare our methods on a fairly recent and strong baseline. 
In particular, 
we follow the setup of a direct precedent of our work \citep{bahng2019learning}
to compare the models for a nine super-class ImageNet classification \citep{ilyas2019adversarial} with class-balanced strategies. 
Also, we follow \citep{bahng2019learning}
to report standard accuracy, 
weighted accuracy, a scenario where
samples with unusual texture are weighted more, 
and accuracy over ImageNet-A \citep{hendrycks2019natural},
a collection of 
failure cases for most ImageNet trained models. 
Additionally, we also report 
the performance over ImageNet-Sketch \citep{WangGLX19}, 
an independently collected ImageNet test set
with only sketch images. 

We test our method with the pipeline made available by \citep{bahng2019learning}, 
and we compare with the vanilla network, 
and several methods that are designed to reduce the texture bias:
including 
StylisedIN (SN) \citep{geirhos2018imagenettrained}, 
LearnedMixin (LM) \citep{clark2019don}, 
RUBi \citep{cadene2019rubi},
and ReBias \citep{bahng2019learning}, 
several other baselines proved effective in learning robust models, 
such as Mix-up \citep{zhang2017mixup},
Cutout \citep{devries2017improved}, 
AugMix \cite{hendrycks2019augmix},
In addition, we compared our worst-case training (WT), 
regularization (Reg), 
and the introduced heuristic (WR). 
For our methods, we follow the observations in \citep{wang2020high} 
suggesting the relationship between frequency-based perturbation and the model's performance,
and design the augmentation of frequency-based perturbation with different radii. 

We report the results in Table~\ref{tab:miniimagenet}. 
Our results suggest that, 
while the augmentation method we used is much simpler 
than the ones used in AugMix, 
our empirical results are fairly strong in comparison. 
With simple perturbation inspired from \citep{wang2020high}, 
our new heuristic outperforms other methods in average 
on these four test scenarios. 

\section{Discussion}
\label{sec:diss}
Before we conclude, we would like to devote a section to discuss several topics more broadly related to this paper. 
% We hope the discussion can also help define the scope of our contributions.

\textbf{Human-aligned machine learning may not be solvable in general without prior knowledge.}
Following our notations in this paper, for any two functions $f_1$ and $f_2$, it is human, instead of any statistical properties, that decides whether $\h=f_1$ or $\h=f_2$. 
This remark is a restatement of our motivating example in Figure~\ref{fig:intro}. 
Our proposed method forgoes the requirement of prior knowledge
and is validated empirically on certain benchmark datasets. 

\textbf{Do all the model's understandings of the data have to be aligned with a human's?}
Probably no.
As we have discussed in the preceding sections, 
we agree that there are also scenarios where it is beneficial for models' perception to outperform a human's. 
For example, we may expect the models to outperform the human vision system when applied to make a scientific discovery at a molecule level. 
This paper investigates these questions for the scenarios where the alignment is essential. 

\textbf{In practice, there is probably more than one source of misaligned features.}
We aim to contribute a principled understanding of 
the problem, starting with its basic form.
The extension of our analysis to multiple sources of misaligned features
is considered a future direction. 

\textbf{Differences between overfitting and non-human-aligned} 
A critical difference is that 
overfitting can typically be observed empirically 
with a split of train and test datasets, 
while learning the misaligned features 
is usually not observed because the misaligned features
can be true across the train and test data split. 
% For example,
% we split $(\X,\Y)$ into the train and test split:
% if both $\h$ and $\s$ exist for $(\X,\Y)$, 
% a vanilla training over the split data is unlikely to be human-aligned;
% on the other hand, if only $\h$ exists for $(\X,\Y)$, 
% a vanilla training over the split data will be human-aligned. 
% However, the model can be overfit or not in both cases. 

\textbf{Other related works}
There is also a proliferation of works 
that aim to improve the robustness of machine learning methods
from a data perspective, 
such as the methods developed to counter 
spurious correlations \citep{vigen2015spurious}, 
confounding factors \citep{mcdonald2014confounding}, 
or dataset bias \citep{torralba2011unbiased}. 
We believe how our analysis is statistically connected to these topics
is also an interesting future direction. 
Further, there is also an active line of research aiming to align the human and models' perception of data 
by studying how humans process the images \citep{KubiliusSHMRIKB19,MarblestoneWK16,nayebi2017biologically,lindsay2018biological,BWCMRBSPT19,DapelloMSGCD20}. 

In addition, discussion of how human annotation will help the models 
to generalize in non-trivial test scenarios has also been explored. 
For example, \citep{ross2017right} built expert annotation into the model 
to regularize the explanation of the models to counter the model's tendency in learning misaligned features. 
The study has been extended with multiple follow-ups to introduce human-annotation into the interpretation of the models \citep{schramowski2020making,teso2019explanatory,lertvittayakumjorn2020find}, 
and shows that the human's knowledge will help model's learning the concepts that can generalize 
in non-i.i.d scenarios. 

\section{Conclusion}
\label{sec:con}
In this paper, 
we built upon the importance of learning human-aligned model
and studied the generalization properties of a model
for the goal of the alignment between the human and the model. 
We extended the widely-accepted generalization error bound 
with an additional term for the differences 
between the human and the model, 
and this new term relies on 
how the misaligned features are associated with the label. 
Optimizing for small empirical loss and small this term 
will lead to a model that is better aligned to humans. 
Thus, 
our analysis naturally offers a set of methods to this problem. 
Interestingly, 
these methods are closely connected to the established methods 
in multiple topics regarding robust machine learning. 
Finally, 
by noticing our analysis can link to two mainstream families of methods of learning robust models, 
we propose a new heuristic of combining them. 
In a fairly advanced experiment, we demonstrate the empirical strength of our new method. 

\subsubsection*{Acknowledgement}
This work was supported in part by NIH R01GM114311, NIH P30DA035778, NSF IIS1617583, NSF CAREER IIS-2150012 and IIS-2204808.  

\bibliography{Wang_576}
% \bibliographystyle{plain}

% \appendix
% \newpage 
% % \section{Appendix}
% %\label{sec:con}
% \input{secs/appendix}

\end{document}