% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 
%% In your camera-ready you should use the 'accepted' parameter. This shows the authors and how an accepted paper will look like. The footer is 'Acccepted for X'. In the final version, the proceedings chairs will add the page numbers for PMLR and the final footer will be 'Proceedings of X'.
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% ****** our packages
\usepackage{xcolor}
% \usepackage[colorlinks=true,allcolors=blue]{hyperref}
\usepackage{amsmath, amssymb, amsthm}
\usepackage{mathtools}
% \usepackage{multibib}
% \newcites{APX}{Appendix}
\newtheorem{theorem}{Theorem}
\newtheorem{assumption}{Assumption}
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{fact}{Fact}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\indep}{\ \large\perp\!\!\!\!\!\!\perp\ }
\DeclareMathOperator*{\defn}{ \ \overset{\mathrm{def}}{=} \ }

\newenvironment{proofsk}{%
  \renewcommand{\proofname}{Proof Sketch}\proof}{\endproof}


\newcommand\blfootnote[1]{%
  \begin{NoHyper}%
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \end{NoHyper}%
}
% things for xr
\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{sicilia_277-supp}
% end things for xr

\title{PAC-Bayesian Domain Adaptation Bounds for Multiclass Learners}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
% 
% Important: in case of equal contributions, we strongly recommend to NOT show it in this part of the paper, but rather describe it in the appropriate section at the end of the paper "Author Contribution", where you have more space to describe how each author contributed.
%
% Add authors
% Remember to use the order convention "First/Given name" "Last/Family name", e.g. John Smith, Hanako Yamada, Marco Rossi, Wei Zhang
\author[1]{\href{mailto:<anthonysicilia@pitt.edu>?Subject=Your UAI 2022 paper}{Anthony Sicilia}{}}
\author[2]{\href{mailto:<kaa139@pitt.edu>?Subject=Your UAI 2022 paper}{Katherine Atwell}{}}
\author[1,2]{\href{mailto:<malihe@pitt.edu>?Subject=Your UAI 2022 paper}{Malihe Alikhani}{}}
\author[3*]{\href{mailto:<seongjae@yonsei.ac.kr>?Subject=Your UAI 2022 paper}{Seong Jae Hwang}{}}
% Add affiliations after the authors
\affil[1]{%
    Intelligent Systems Program\\
    University of Pittsburgh\\
    Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Department of Computer Science\\
    University of Pittsburgh\\
    Pittsburgh, Pennsylvania, USA
}
\affil[3]{%
    Department of Artificial Intelligence\\ Yonsei University\\
    Seoul, South Korea
  }
  
  \begin{document}
\maketitle

\begin{abstract}
Multiclass neural networks are a common tool in modern unsupervised domain adaptation, yet an appropriate theoretical description for their non-uniform sample complexity is lacking in the adaptation literature. To fill this gap, we propose the first PAC-Bayesian adaptation bounds for multiclass learners. We facilitate practical use of our bounds by also proposing the first approximation techniques for the multiclass distribution divergences we consider. For divergences dependent on a Gibbs predictor, we propose additional PAC-Bayesian adaptation bounds which remove the need for inefficient Monte-Carlo estimation. Empirically, we test the efficacy of our proposed approximation techniques as well as some novel design-concepts which we include in our bounds. Finally, we apply our bounds to analyze a common adaptation algorithm that uses neural networks.
\end{abstract}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:intro}
\blfootnote{$^*$Corresponding Author}Multiclass neural networks are frequently used in implementation of many unsupervised domain adaptation algorithms. For example, neural networks are often employed for invariant feature learning algorithms \citep{ganin2015unsupervised, long2017deep, long2018conditional, zhang2019bridging}, importance weighting algorithms \citep{lipton2018detecting}, or combinations of both techniques \citep{tachet2020domain}. While most of these adaptation algorithms are motivated by theoretical bounds, recent literature has paid close attention to the assumptions and failure-cases of some techniques \citep{zhao2019learning, wu2019domain, johansson2019support}. Namely, some learning algorithms \textit{ignore} key terms in the adaptation bounds on which they are based, and as a result, may output solutions (i.e., learned models) that violate assumptions and are \textit{guaranteed} to fail at the adaptation task \citep{zhao2019learning, wu2019domain}. Still, the story here is not totally complete. In particular, there has not been much discussion of the non-uniform sample complexity of these modern adaptation algorithms. Sample complexity, in fact, contributes an additional ``ignored'' term in the theoretical bounds on which modern adaptation algorithms are based. 

In this paper, we propose the first multiclass adaptation bounds which allow us to study this non-uniform sample complexity. Studying sample complexity is important to our understanding of adaptation algorithms because it describes how ``data-hungry'' an algorithm is. When this sample complexity is non-uniform across an algorithm's solution space, it allows us to study properties of a solution as a function of its ``data-hunger.'' This is especially important for adaptation algorithms, which as mentioned, can inadvertently output poor solutions. Identifying a dynamic relationship between the properties of solutions and their non-uniform sample-complexity can provide insight on how to prevent these failure-cases in practice (e.g., by collecting sufficient data for an algorithm).
Non-uniform sample complexity (rather than uniform complexity) can also help us to better quantify implicit regularization inherent to our algorithm \citep{dziugaite2017computing, nagarajan2019uniform}. Accurately describing implicit regularization is especially important when using neural networks \citep{neyshabur2014search, neyshabur2017implicit, keskar2017large, zhang2017understanding}, since similar learning algorithms can lead to solutions with distinct generalization performance and implicit regularization is believed to be the cause of this phenomena.

Despite the importance of studying non-uniform sample complexity in modern adaptation contexts, we are not aware of any multiclass adaptation bounds with this capability. To fill this gap, we contribute the first PAC-Bayesian adaptation bound for multiclass learners (Thm.~\ref{thm:pb-bound}). While PAC-Bayesian bounds actually control error for \textit{stochastic} models, we choose this framework for its demonstrated empirical accuracy in describing neural network sample complexity \citep{dziugaite2017computing, zhou2018non, jiang2019fantastic, dziugaite2020search, dziugaite2021role, perez2021tighter}. Compared to existing bounds, we design our proposals to be more sensitive to the solution output by our learning algorithm as well as the data sample available for estimating key quantities. The former is vital in studying non-uniform complexity of adaptation algorithms (as discussed), while the latter is important for facilitating empirical study. To make our bound useful in practice, we also propose the first approximation techniques for the divergence terms in our bound. In one case, this involves proposal of a novel surrogate for optimizing 01-loss (Thm.~\ref{thm:surrogate_loss}). In another, we show a standard technique for computing divergence fails to generalize to the mutliclass setting without additional constraints (Thm.~\ref{thm:mdp_div_red2erm}). Working in the PAC-Bayesian framework, some divergences we study are also expressed as expectations with no known analytic solution. For these, we propose additional bounds (Thm.~\ref{thm:pb-bound-efficient}, Cor.~\ref{cor:pb-bound-efficient}) which allow us to avoid inefficient Monte-Carlo estimation by introduction of a new flatness assumption related to the well-known flat-minima hypothesis \citep{hochreiter1997flat}. To conclude, we conduct extensive empirical study of more than 12K models learned across 5 diverse adaptation datasets. 
% We validate the new design-concepts in our adaptation bounds, our proposed approximation techniques, and our novel flatness assumption. Our last empirical analysis uses the proposed bounds to empirically study sample complexity of a common domain-invariant learning algorithm. Our findings reveal unexpected relationships between sample complexity and important properties of the algorithm we study. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Background}
\label{sec:background}
\subsection{Notation and Assumptions}
Consider the space $\mathcal{X} \times \mathcal{Y}$ for some finite $\mathcal{Y}$ with $|\mathcal{Y}| > 2$ unless otherwise noted. Colloquially, we call $\mathcal{X}$ the feature space and $\mathcal{Y}$ the label space. For a distribution $\mathbb{D}$ over $\mathcal{X} \times \mathcal{Y}$, we are interested in the risk functional $\mathbf{R}_\mathbb{D}: \mathcal{Y}^\mathcal{X} \to [0,1]$ 
\begin{equation}\small
\label{eqn:risk}
\mathbf{R}_\mathbb{D}(h) \defn \mathbf{Pr}(h(X) \neq Y); \qquad (X,Y) \sim \mathbb{D}
\end{equation}
applied to some hypothesis (i.e., model) $h \in \mathcal{H} \subseteq \mathcal{Y}^\mathcal{X}$. The risk functional $\mathbf{R}_\mathbb{D}$ precisely gives the error rate of the hypothesis $h$ when tasked with modelling the relationship between $\mathcal{X}$ and $\mathcal{Y}$ described by $\mathbb{D}$. In PAC-Bayes, we also consider the risk of stochastic (Gibbs) predictors. For a distribution $\mathbb{Q}$ over $\mathcal{H} \subseteq \mathcal{Y}^\mathcal{X}$, the Gibbs risk is the expectation
\begin{equation}\small
\label{eqn:gibbs_risk}
    \mathbf{R}_\mathbb{D}(\mathbb{Q}) \defn \mathbf{E}[\mathbf{R}_\mathbb{D}(H)]; \qquad H \sim \mathbb{Q}.
\end{equation}
For neural networks, a common stochastic formulation is to sample weights from the distribution $\mathbb{Q}$ before inference -- e.g., the Bayesian neural networks of \citet{blundell2015weight}.
%defined for $\mathbb{D}$ over $\mathcal{X} \times \mathcal{Y}$. 

Throughout this paper, we assume a source distribution $\mathbb{S}$ over $\mathcal{X} \times \mathcal{Y}$ and a target distribution $\mathbb{T}$ over $\mathcal{X} \times \mathcal{Y}$. We assume observation of an i.i.d. random sample $S \sim \mathbb{S}^n$ and an i.i.d. random sample $T_X \sim \mathbb{T}_X^m$ where the subscript $X$ denotes the $\mathcal{X}$-marginal of a distribution. In this context, an algorithm for the unsupervised adaptation problem we study is a function $(S, T_X) \mapsto h \in \mathcal{H}$. We are interested in bounds on $\mathbf{R}_\mathbb{T}(h)$ for such algorithms. 

Interchangeably, we think of the sample $S$ as both a random variable with distribution $\mathbb{S}^n$ and a (random) distribution itself, since any observation of a sample $S$ uniquely defines its own empirical distribution over $\mathcal{X} \times \mathcal{Y}$ by the pmf
\begin{equation}\small
\label{eqn:sample_pmf}
    (x,y) \mapsto  n^{-1}\sum\nolimits_{i=1}^n \mathbf{1}_{\{(x_i, y_i)\}}\{(x,y)\}
\end{equation}
where $\mathbf{1}$ is the indicator function. So, $\mathbf{R}_S$ is well-defined by this identification. $\mathbf{R}_S(\mathbb{Q}) = \mathbf{E}_{H \sim \mathbb{Q}}[\mathbf{R}_S(H)]$ is also defined -- the observation of $S$ is used, not integrated out.

Finally, we also use distribution divergences based on the $\mathcal{H}$-divergence proposed by \citet{ben2007analysis}. This divergence is a specification of the $\mathcal{A}$-distance \citep{kifer2004detecting} which relaxes the total variation distance by considering only a subset $\mathcal{A}$ of measurable sets when taking the supremum. In particular, the $\mathcal{H}$-divergence considers sets identifiable by a class $\mathcal{H} \subseteq \{0,1\}^\mathcal{X}$
\begin{equation}\small
\label{eqn:hyp_class_divergence}
    \mathbf{d}_\mathcal{H}(\mathbb{D}_1, \mathbb{D}_2) \defn \sup\nolimits_{h \in \mathcal{H}} \big \lvert \mathbf{E}[h(X_1)] - \mathbf{E}[h(X_2)] \big \rvert
\end{equation}
where $X_i \sim \mathbb{D}_i$. 
% Note, the $\mathcal{H}$-divergence is symmetric and obeys the triangle-inequality for any choice of $\mathcal{H}$. 
While it is typically defined with a factor of 2, we omit this for convenience. Given a class $\mathcal{H} \subseteq \mathcal{Y}^\mathcal{X}$, we first study the $\mathcal{H}\Delta\mathcal{H}$-divergence based on the class 
\begin{equation}\small
\label{eqn:mid_class_defn}
\begin{split}
\mathcal{H}\Delta\mathcal{H} & \defn \big \{x \mapsto 1-\mathbf{1}_{\{h(x)\}}\{h'(x)\} \mid (h,h') \in \mathcal{H}^2 \big \}.  
\end{split}
\end{equation}
This is a multiclass generalization, which simplifies to the original binary definition of Ben-David et al. when $|\mathcal{Y}| = 2$.
\subsection{Some Existing Adaptation Bounds}
\label{sec:background_bounds}
In this section, we discuss two adaptation bounds. More detailed knowledge of these bounds will be useful later for comparison with our proposed bounds. First, we discuss the seminal uniform convergence bound proposed by \citet{ben2007analysis, ben2010theory}. Second, we discuss a PAC-Bayesian bound proposed by \citet{germain2020pac}.
\subsubsection{Adaptation Based on Uniform Convergence}
\begin{theorem}\label{thm:ben2010theory}
\citep{ben2010theory}
Let $\mathcal{Y}$ be binary. For all $\delta > 0$, w.p. at least $1-\delta$, for all $h \in \mathcal{H}$
\begin{equation}\small
\begin{split}
    & \mathbf{R}_\mathbb{T}(h) \leq \lambda + \mathbf{R}_S(h) + \mathbf{d}_{\mathcal{H}\Delta\mathcal{H}}(S_X, T_X) \\
    & + 4\sqrt{\tfrac{4 \nu \ln (2m) - \ln (\delta / 4)}{m}} + 2\sqrt{\tfrac{8\nu \ln(em / \nu) - 2 \ln(\delta / 4)}{m}}
\end{split}
\end{equation}
where $\lambda = \min_{\eta \in \mathcal{H}} \mathbf{R}_\mathbb{S}(\eta) + \mathbf{R}_\mathbb{T}(\eta)$ and $\nu = \mathrm{VCDim}(\mathcal{H})$.
\end{theorem}
% \begin{proof}
% This is Thm. 2 of \citet{ben2010theory} with added bound on $\mathbf{R}_S(h) - \mathbf{R}_\mathbb{S}(h)$ by standard uniform convergence arguments; e.g., Ch. 28.1 of \citet{shalev2014understanding}. Boole's Inequality is used to combine bounds. 
% \end{proof}
The seminal result above is the standard adaptation bound on which many newer results are based. Still, this uniform convergence bound is not well-suited for every application. We discuss some limitations below.

%\textbf{Uniform Sample Complexity}\hspace{.5em} 
\paragraph{Uniform Sample Complexity}
Simply put, uniform convergence is too conservative: it assigns the same sample complexity to each outcome of our learning algorithm, regardless of the solution quality. As discussed in Section~\ref{sec:intro}, this prevents us from studying important properties of a model as a function of its sample complexity.
% , and further, prevents us from capturing any implicit regularization inherent to the training process. % Both of these are important when training neural networks via SGD, because solution quality can vary substantially for similar learning algorithms \citep{zhang2017understanding}.

% \textbf{Model-Independent Divergence}\hspace{.5em}
\paragraph{Model-Independent Divergence}
In general, divergence is meant to characterize the similarity in feature distributions under the source $\mathbb{S}$ and the target $\mathbb{T}$. Similar to above, independence of the divergence $\mathbf{d}_{\mathcal{H}\Delta\mathcal{H}}$ and the model $h$ is overly conservative and makes this term insensitive to changes in the outcome of our learning algorithm. For example, when $\mathcal{H}$ is fixed, this divergence cannot distinguish between a random initialization and a carefully trained solution.

%\textbf{Sample-Independent Adaptability}\hspace{.5em}
\paragraph{Sample-Independent Adaptability}
The term $\lambda$ is often called the \textit{adaptability}. It is a measure of similarity in the labeling functions of $\mathbb{S}$ and $\mathbb{T}$, characterizing the extent to which one hypothesis in $\mathcal{H}$ can do well on \textit{both} of these distributions. When no such hypothesis exists, it is unclear how a learner could successfully adapt by minimizing risk on the source distribution \citep{ben2010theory}. Importantly, this term has been central to the theoretical discussion of failure-cases in widely used DA algorithms \citep{johansson2019support, zhao2019learning}. Meanwhile, estimation of $\lambda$ remains an under-studied research area \citep{redko2020ASO}. One problem, which we observe, is independence of $\lambda$ from the samples $S$ and $T$. In particular, one cannot directly compute the population statistic $\lambda$ in typical circumstances. Instead, one might estimate using $\min_\eta \mathbf{R}_S(\eta) + \mathbf{R}_T(\eta)$, but this requires verifying generalization of a learned model $h^* \in \argmin_\eta \mathbf{R}_S(\eta) + \mathbf{R}_T(\eta)$ using a holdout set or some other descriptor of generalization performance (e.g., such as a learning bound). This is undesirable when, as in this paper, we wish to study adaptability in an empirical context. As we show in later experiments (Section~\ref{sec:experiments}), the extra generalization requirement typically inflates our estimation of $\lambda$, and subsequently, mars the results we would like to interpret.

%\textbf{Binary Label Space}\hspace{.5em}
\paragraph{Binary Label Space}
It is also important to note that this bound was designed for binary learners. Computation of the $\mathcal{H}\Delta\mathcal{H}$-divergence is the most concerning issue, since existing algorithms for computation rely on symmetry of $\mathcal{H}$ and ERM over the class $\mathcal{H}\Delta\mathcal{H}$. In Section~\ref{sec:method_approx}, we discuss these issues in detail and present some solutions.
\subsubsection{A PAC-Bayesian Bound for Binary Learners}
% \begin{theorem}\label{thm:germain2020pac}
% \citep{germain2020pac} Let $\mathcal{Y}$ be binary, $\mathbb{P}$ any distribution over $\mathcal{H}$, and $\omega > 0$. For all $\delta > 0$, w.p. at least $1-\delta$, for all distributions $\mathbb{Q}$ over $\mathcal{H}$,
% \begin{equation}\small
% \begin{split}
%     & \mathbf{R}_\mathbb{T}(\mathbb{Q}) \leq \omega' ( \mathbf{R}_S(\mathbb{Q}) + |\mathrm{d}_S(\mathbb{Q}) - \mathrm{d}_T(\mathbb{Q})| ) \\
%     & + |\mathrm{e}_\mathbb{S}(\mathbb{Q}) - \mathrm{e}_\mathbb{T}(\mathbb{Q})| + 2\omega \tfrac{\mathrm{KL}(\mathbb{Q} \mid \mid \mathbb{P})  - \ln ( \delta / 3) }{m\omega'} + 2(\omega' - 1)
% \end{split}
% \end{equation}
% where $\omega' = 2\omega / (1 - \exp(-2\omega))$ and for $H_i \sim (\mathbb{Q})_i$, $(X,Y) \sim \mathbb{S}$ we have
% \begin{equation}\small
% \begin{split}
%     & \mathrm{e}_\mathbb{S}(\mathbb{Q}) \defn \mathbf{E}[(1-\mathbf{1}_{\{H_1(X)\}}\{Y\}) (1-\mathbf{1}_{\{H_2(X)\}}\{Y\})], \\
%     & \mathrm{d}_{\mathbb{S}}(\mathbb{Q}) \defn \mathbf{E} [1 - \mathbf{1}_{\{H_1(X)\}}\{H_2(X)\}].
% \end{split}
% \end{equation}
% \end{theorem}
% % \begin{proof}
% % This is a simplification of Thm.~7 of \citet{germain2020pac}. We set $\omega = a$ in the original notation and use the fact that $\omega / (1 - \exp(-\omega))$ is increasing for $\omega > 0$. \end{proof}
% %In this section, we consider Thm.~\ref{thm:germain2020pac} in Appendix~\ref{sec:germain2020pac}, which is a result of \citet{germain2020pac}.
We give Thm.~\ref{thm:germain2020pac} in Appendix~\ref{sec:germain2020pac}, which is one of the first PAC-Bayesian adaptation bounds. While \citet{germain2013pac, germain2016new, germain2020pac} propose other bounds, we focus on Thm.~\ref{thm:germain2020pac} because it is easiest to compare to the proposal of \citet{ben2010theory}. 
% In particular, the absolute difference in disagreement $\mathrm{d}$ can be viewed as similar to the $\mathcal{H}\Delta\mathcal{H}$-divergence and the absolute difference in joint-error $\mathrm{e}$ can be viewed as similar to the adaptability $\lambda$ \citep{germain2020pac}. 
While tailored to Thm.~\ref{thm:germain2020pac}, the weaknesses discussed below are generally applicable to other bounds of Germain et al. 

%\textbf{Benefits Compared to Thm.~\ref{thm:ben2010theory}}\hspace{.5em} 
\paragraph{Benefits Compared to Thm.~\ref{thm:ben2010theory}}
One benefit of Thm.~\ref{thm:germain2020pac} is that the divergence employed in this bound is \textbf{model-dependent} (rather than independent); namely, it depends on the Gibbs predictor $\mathbb{Q}$, whose target error we bound. As mentioned, model-independence is an overly conservative quality and \citet{germain2020pac} show this formally by proving their divergence actually lowerbounds $\mathbf{d}_{\mathcal{H}\Delta\mathcal{H}}$ for all $\mathbb{Q}$ and $\mathcal{H}$. Another primary benefit is that Thm.~\ref{thm:germain2020pac} employs a \textbf{non-uniform} sample complexity. Specifically, complexity is measured through a KL-divergence $\mathrm{KL}(\mathbb{Q} \mid \mid \mathbb{P})$, which explicitly depends on the outcome of the learning algorithm $\mathbb{Q}$. Simply put, a model is complex if it deviates much from our prior knowledge, which is captured in the prior $\mathbb{P}$.
% Recalling our previous discussion, we expect this non-uniform notion of complexity to better capture the observed sample-efficiency of neural-networks. PAC-Bayesian bounds are actually (exclusively) known for this quality. As we are aware, when computed in practice, non-uniform bounds based on PAC-Bayes are the only learning bounds which have been shown to accurately describe a neural-network's generalization \citep{dziugaite2021role, perez2021tighter, sicilia2021}. In fact, it was only recently that non-vacuous learning bounds have been computed on standard datasets \citep{dziugaite2017computing, zhou2018non}; these also depend on the PAC-Bayesian framework.

% \textbf{Weaknesses Shared with Thm.~\ref{thm:ben2010theory}}\hspace{.5em}
\paragraph{Weaknesses Shared with Thm.~\ref{thm:ben2010theory}}
Despite its benefits over Thm.~\ref{thm:ben2010theory}, Thm.~\ref{thm:germain2020pac} also shares some weaknesses. First, the proposed adaptability term is also \textbf{sample-independent}. Second, the bound is still designed for a \textbf{binary label space} $\mathcal{Y}$. Unlike the case of Thm.~\ref{thm:ben2010theory}, it is not computation of the bound that causes concern, but the \textit{validity} of the bound in mutliclass settings. In particular, the problem arises because the proof of Thm.~\ref{thm:germain2020pac} relies on a decomposition of the risk which assumes $|\mathcal{Y}| = 2$. This decomposition does not hold, in general, when $\mathcal{Y}$ is larger. In fact, \citet{germain2020pac}, themselves, observe Thm.~\ref{thm:germain2020pac} is not easily extended to multiclass settings, leaving the investigation of such PAC-Bayes bounds as an open problem. For some additional empirical study of Thm.~\ref{thm:germain2020pac}, see Appendix~\ref{sec:comp2germain}.
% Empirically, in Appendix~\textcolor{red}{X} we 
% Finally, we observe that Thm.~\ref{thm:germain2020pac} requires optimization of $\omega$ to be useful in practical settings. Most other PAC-Bayes bounds employed in practical settings either handle the optimization of free constants directly \citep{langford2001not, dziugaite2017computing} or, more recently, rely on variations which remove the dependence on free constants \citep{dziugaite2021role, perez2021tighter}.
\subsection{Other Related Works}
\label{sec:related}
Besides those works discussed above, there are some additional works, which propose alternate theories of adaptation. Some theories of adaptation use distinct integral probability metrics in place of the $\mathcal{H}$-divergence \citep{redko2017theoretical, shen2018wasserstein, johansson2019support}, while others have sought to generalize and modify the $\mathcal{H}$-divergence \citep{mansour2009domain, kuroki2019unsupervised, zhang2019bridging}. Meanwhile, others focus on assumptions distinct from small adaptability. These include covariate shift \citep{sugiyama2007covariate, you2019towards}, label shift \citep{lipton2018detecting} and \textit{generalized} label shift \citep{tachet2020domain}. The DA problem can also be modeled through causal graphs \citep{zhang2015multi, magliacane2018domain} and some extensions to DA consider a meta-distribution over targets \citep{blanchard2021domain, albuquerque2019adversarial, deng2020representation}. Notably, most assumptions are untestable in practice and not many works consider such testing, even in controlled research settings where it might be possible. As we are aware, we are the first to use a sample-dependent adaptability, which improves estimation in empirical study.

In adaptation, PAC-Bayesian results are almost exclusively due to \citet{germain2013pac, germain2016new, germain2020pac}. Albeit, in transfer learning some work does exist \citep{li2007bayesian, mcnamara2017risk}. Most directly, our work employs the PAC-Bayes bound of \citet{maurer2004note} in proofs as well as some techniques of \citet{langford2001not, dziugaite2017computing}, and \citet{perez2021tighter} in empirical study. Most notably, ours is the only PAC-Bayesian work to propose multiclass adaptation bounds. A more in depth coverage of relevant literature -- for adaptation and PAC-Bayes -- is available in Appendix~\ref{sec:ext_related}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Proposed Bounds}
\label{sec:method}
In this section, we give the proposed adaptation bounds for multiclass learners. We also provide novel algorithms for computing two multiclass divergence terms and compare these to existing approaches. Lastly, we give a second adaptation bound which removes the need for inefficient Monte-Carlo estimation of divergences dependent on a Gibbs predictor. Proof of all results is given in Appendix~\ref{sec:proofs}.
\subsection{A PAC-Bayesian Adaptation Bound for Multiclass Learners}
\label{sec:method_bound}
We begin by introducing a model-dependent class of hypotheses similar to $\mathcal{H}\Delta\mathcal{H}$. Precisely, for $h \in \mathcal{H}$,
\begin{equation}\small
    h\Delta\mathcal{H} \defn \big \{x \mapsto 1-\mathbf{1}_{\{h(x)\}}\{h'(x)\} \mid h' \in \mathcal{H} \big \}.  
\end{equation}
With it, we propose to use the $h\Delta\mathcal{H}$-divergence $\mathbf{d}_{h\Delta\mathcal{H}}$ in our adaptation bounds. This divergence is a \textbf{model-dependent} extension of the $\mathcal{H}\Delta\mathcal{H}$-divergence, applicable in multiclass settings. It is easy to observe from the definitions that this new divergence lowerbounds the $\mathcal{H}\Delta\mathcal{H}$-divergence for all $\mathcal{H}$ and all $h$. Both \citet{zhang2019bridging} and \citet{kuroki2019unsupervised} study similar divergences for bounding 01-loss in the binary case, but we are first to use this divergence with non-uniform sample complexity, and also, the first to use this divergence in an adaptation bound for multiclass learners.\footnote{We discuss a multiclass proposal of \citet{zhang2019bridging} later, but it is based on margin loss and used for uniform convergence.} Our full proposal requires novel technique and theoretical study to compute this divergence (see Section~\ref{sec:method_approx}).


Next, we give the proposed adaptation bound. As alluded, the bound has a number of notable features and we expand on these in comparison to Thms.~\ref{thm:ben2010theory} and \ref{thm:germain2020pac} after its statement.
\begin{theorem}
\label{thm:pb-bound}
For any $\mathbb{P}$ over $\mathcal{H}$, all $\delta > 0$, w.p. at least $1-\delta$, for all $\mathbb{Q}$ over $\mathcal{H}$
\begin{equation}\small
\begin{split}
    \mathbf{R}_\mathbb{T}(\mathbb{Q}) & \leq \tilde{\lambda}_{S, T} + \mathbf{R}_S(\mathbb{Q}) + \mathbf{E}_{H \sim \mathbb{Q}}[\mathbf{d}_{\mathcal{C}_{H}}(S_X, T_X)] \\
    & + \sqrt{\tfrac{\mathrm{KL}(\mathbb{Q} \mid \mid \mathbb{P}) + \ln \sqrt{4m} - \ln ( \delta ) }{2m}}
\end{split}
\end{equation}
where $\tilde{\lambda}_{S,T} = \min_{\eta \in \mathcal{H}} \mathbf{R}_S(\eta) + \mathbf{R}_T(\eta)$ and we may choose either $\mathcal{C}_h = \mathcal{H} \Delta \mathcal{H}$ for all $h$ as before or $\mathcal{C}_h = h \Delta \mathcal{H}$.
\end{theorem}
% \begin{proofsk}
% The result relies on three main Lemmas. Lemma~\ref{lem:maurer2004note} is the PAC-Bayes bound of \citet{maurer2004note} which lets us bound $\mathbf{R}_T(\mathbb{Q}) - \mathbf{R}_\mathbb{T}(\mathbb{Q})$. Lemma~\ref{lem:ben2010theory_multi} and Lemma~\ref{lem:ben2010theory_stoch} are then used in combination to bound $\mathbf{R}_S(\mathbb{Q}) - \mathbf{R}_T(\mathbb{Q})$. Lemma~\ref{lem:ben2010theory_multi} is a multiclass extension of an important inequality in the proof of Thm.~\ref{thm:ben2010theory}, which also holds in case $\mathcal{C}_h = h \Delta \mathcal{H}$. It simply requires a verification of the triangle-inequality for 01-loss \citep{ben2007analysis, crammer2007learning} in the multiclass case. Lemma~\ref{lem:ben2010theory_stoch} characterizes the inequality of Lemma~\ref{lem:ben2010theory_multi} in expectation over $\mathbb{Q}$. It mostly relies on Jensen's Inequality and a treatment of the random samples $S$ and $T$ as distributions themselves through the identification in Eq.~\eqref{eqn:sample_pmf}.
% \end{proofsk}
% \textbf{Comparison to Thms.~\ref{thm:ben2010theory} and \ref{thm:germain2020pac}}\hspace{.5em} 
\paragraph{Comparison to Thms.~\ref{thm:ben2010theory} and \ref{thm:germain2020pac}}
By design, the bound proposed above resolves the weaknesses mentioned in Section~\ref{sec:background_bounds}. First and foremost, we remove the requirement that $\mathcal{Y}$ is binary. Second, we use a \textbf{non-uniform} notion of sample complexity; i.e., $\mathrm{KL}(\mathbb{Q}\mid\mid\mathbb{P})$. Third, Thm.~\ref{thm:pb-bound} allows for \textit{either} a \textbf{model-dependent} or \textbf{model-independent} notion of data-distribution divergence. While model-independent divergences do have some weaknesses, we retain them in our bound since, as discussed later, they can be more efficient. 
% Therefore, choosing $\mathcal{C}_h = \mathcal{H} \Delta \mathcal{H}$ will be an overly conservative choice when comparing individual models. On the other hand, it can be a more efficient choice when one wishes to compare neural architectures for adaptation. 
Lastly, Thm.~\ref{thm:pb-bound} employs a \textbf{sample-dependent} notion of adaptability. Compared to $\lambda$, the new adaptability $\tilde{\lambda}_{S, T}$ is the smallest error achievable on the \textit{samples}. In research contexts wherein we assume access to target labels for purpose of studying our assumptions, we later show that this quantity is fairly easy to empirically bound. 
% (typically unobserved) quantities to verify assumptions, this removes the need to verify generalization beyond the samples in hand. Subsequently, the information we care about is isolated as demonstrated, empirically, in Section~\ref{sec:experiments}. As a final observation, compared to Thm.~\ref{thm:germain2020pac}, this bound removes any free-constants which need to be optimized. Barring a some additional details, which will be discussed in the next parts, the bound is ready to compute as presented.
\subsection{Approximating Multiclass Divergence}
\label{sec:method_approx}
\subsubsection{The Multiclass \texorpdfstring{$\mathcal{H}\Delta\mathcal{H}$}{H{Delta}H}-divergence}
\label{sec:method_approx_mid_div} First, since the $\mathcal{H}\Delta\mathcal{H}$-divergence is model-independent, the expectation with respect to the Gibbs predictor $\mathbb{Q}$ simplifies significantly. In particular, if $\mathcal{C}_h = \mathcal{H}\Delta\mathcal{H}$, we have
\begin{equation}\small
\mathbf{E}_{H \sim \mathbb{Q}}[\mathbf{d}_{\mathcal{C}_H}(S_X, T_X)] = \mathbf{d}_{\mathcal{H}\Delta\mathcal{H}}(S_X, T_X).
\end{equation}
Thus, computation of this divergence simplifies to computing the $\mathcal{H}\Delta\mathcal{H}$-divergence for multiclass learners. 

%\textbf{Summary of Approach}\hspace{.5em} 
\paragraph{Summary of Approach}
In general, we take inspiration from the proposal of \citet{ben2010theory} who compute $\mathcal{H}\Delta\mathcal{H}$-divergence when models in $\mathcal{H}$ have binary output. Namely, we frame computation as minimization of error in a specific classification problem. To adapt this strategy to the multiclass setting, we do two primary things. First, we remove the assumption that $\mathcal{H}$ is symmetric. This is important for multiclass settings since we have no reason to believe $\mathcal{H}\Delta\mathcal{H}$ is typically symmetric.
% \footnote{We do not formally prove this in any case. Informally, it appears to be true by a counting argument. There are more ways for multiclass predictors to disagree than agree.}
We replace the symmetry in $\mathcal{H}\Delta\mathcal{H}$ with a symmetry in our classification problems. Second, for score-based classifiers such as neural networks, we give an optimization procedure for approximating ERM over this class based on a surrogate loss function. As we are aware, this is the first algorithm for approximating ERM over $\mathcal{H}\Delta\mathcal{H}$ when models in $\mathcal{H}$ have multiclass output.

\paragraph{Reduction to ERM}
\begin{theorem}
\label{thm:mid_div_red2erm}
Let $\mathcal{C} = \mathcal{H}\Delta\mathcal{H}$. Almost surely,
\begin{equation}\small
    \mathbf{d}_{\mathcal{C}}(S_X, T_X) = \max \begin{rcases}
    \begin{dcases}
       1 - \min_ {\varphi \in \mathcal{C}}
       \mathbf{R}_P(\varphi) + \mathbf{R}_Q(\varphi), \\
      1 - \min_ {\varphi \in \mathcal{C}} \mathbf{R}_U(\varphi) + \mathbf{R}_V(\varphi)
    \end{dcases}
  \end{rcases}
\end{equation}
where
\begin{equation}\small
\begin{split}
    P & = \big((X_i, 1) \mid X_i \in S_X \big), \ Q = \big((\tilde{X}_i, 0) \mid \tilde{X}_i \in T_X \big), \\
    U & = \big((X_i, 0) \mid X_i \in S_X \big), \ V = \big ((\tilde{X}_i, 1 ) \mid \tilde{X}_i \in T_X \big).
\end{split}
\end{equation}
\end{theorem}
Notice, pooled samples $P +_\mathrm{c} Q$ and $U +_\mathrm{c} V$ define binary classification problems ($+_\mathrm{c}$ is concatenation). Namely, they represent an identification problem wherein the learner must distinguish between the samples $S_X$ and $T_X$. To compute divergence as above, we need only select $\varphi$ to minimize the sum of class-conditional error rates for these problems. Even in simple cases, risk-minimization can be computationally hard \citep{shalev2014understanding}. Thus, we instead select $\varphi$ by optimizing a surrogate loss.

% \textbf{Approximate Minimization via Surrogate}\hspace{.5em}
\paragraph{Approximate Minimization via Surrogate}
WLOG, $\mathcal{Y} = \{1, \ldots, C\}$. We consider a score-based class $\mathcal{S}$ written
\begin{equation}\small
    \label{eqn:mcsb_hclass}
    \mathcal{S} \defn \big \{ \Psi_\mathbf{f} \mid \mathbf{f} \in \mathcal{F} \big\}; \quad \Psi_\mathbf{f}(x) \defn  \argmax\nolimits_{\ell \in [C]} \mathbf{f}_\ell(x)
\end{equation}
with $\mathcal{F} \subseteq \{\mathbf{f} \mid \mathbf{f}_\ell : \mathcal{X} \to \mathbb{R}, \ \ell \in [C]\}$ a class of scoring-functions. In case of ties, suppose $\argmax$ returns the least label. Using the na\"ive definition in Eq.~\eqref{eqn:mid_class_defn},
\begin{equation}\small
    \mathcal{S}\Delta\mathcal{S} \defn \big \{x \mapsto 1-\mathbf{1}_{\{\Psi_\mathbf{f}(x)\}}\{\Psi_\mathbf{g}(x)\} \mid (\mathbf{f},\mathbf{g}) \in \mathcal{F}^2 \big \}.
\end{equation}
At first glance, it is unclear how to pick $\varphi \in \mathcal{S}\Delta\mathcal{S}$ to minimize error on a given dataset. So, in place of this obscure definition, the following result gives a surrogate loss which upperbounds the 01-loss on the original problem. Thus, we indirectly reduce the error by minimizing the surrogate.
\begin{theorem}
\label{thm:surrogate_loss}
% if this happens, h = 0 => L = 1 >= 01-loss
% Let $x \in \mathcal{X}$ and $(\mathbf{f}, \mathbf{g}) \in \mathcal{F}^2$ such that maximum scores in $\mathbf{f}(x)$ and $\mathbf{g}(x)$ are unique. 
Suppose $\tau : \mathbb{R} \to \mathbb{R}_{\geq 0}$ is differentiable and monotone increasing. Let $\mathbf{A} = \big (\tau \circ \mathbf{g}(x) \big) \cdot \big (\tau \circ \mathbf{f}(x)^\mathrm{T} \big )$ with $\tau$ applied element-wise and $\mathbf{f}, \mathbf{g} \in \mathcal{F}$. Set
% \begin{equation}\small
%     1-\mathbf{1}_{\{\Psi_\mathbf{f}(x)\}}\{\Psi_\mathbf{g}(x)\} = \argmax\nolimits_{\ell \in \{0,1\}} \mathbf{h}_\ell(x)
% \end{equation}
% where $\mathbf{h}(x) = (z_x, -z_x)$ and
\begin{equation}\small
\begin{split}
    z(x) & \defn \max\nolimits_{(j,k) \in [C]^2} \mathbf{A}_{jk} - \max\nolimits_{i \in [C]} \mathbf{A}_{ii}, \\
    \mathcal{L}(z, y) & \defn \ln(1 + \exp(-(2y-1) \cdot z)) / \ln(2).
\end{split}
\end{equation}
Then, if $\varphi(x) = 1-\mathbf{1}_{\{\Psi_\mathbf{f}(x)\}}\{\Psi_\mathbf{g}(x)\}$, we have
\begin{equation}\label{eqn:score-based-ub}\small
    \mathbf{R}_\mathbb{D}(\varphi) \leq \mathbf{E}_{(X,Y) \sim \mathbb{D}}  \ \mathcal{L}(z(X), Y)
\end{equation}
for any distribution $\mathbb{D}$ s.t. $\mathbf{f}(X)$ has no repeated scores and $\mathbf{g}(X)$ has no repeated scores, almost surely.\footnote{This stipulation on $\mathbb{D}$ is not too strict. It only assumes ties in the scores of $\mathbf{f}$ or $\mathbf{g}$ are \textit{very} unlikely, so these ties can be ignored.}
\end{theorem}
We point out the log loss $\mathcal{L}(z, y)$ is differentiable with respect to $z$ and $z(x)$ is differentiable with respect to $\mathbf{f}$ and $\mathbf{g}$. In practice, functions in $\mathcal{F}$ -- such as $\mathbf{f}$ and $\mathbf{g}$ -- are typically differentiable with respect to a real-parameter vector, which also defines the function. For example, this is precisely the case for neural networks. In these contexts, since composition preserves differentiability, the output of the surrogate $\mathcal{L}$ is differentiable with respect to the real-parameter vector. So, the RHS of Eq.~\eqref{eqn:score-based-ub} may be approximately minimized using batch SGD. At this point, the proposed algorithm should be familiar to the typical practitioner. It is equivalent to the manner in which we usually optimize a neural network, except for  the new surrogate $(\mathbf{f}, \mathbf{g}, x, y) \mapsto \mathcal{L}(z(x), y)$. 
% As a final note, observe that $\mathcal{L}(z, y)$ may be substituted with \textit{any} differentiable function that upperbounds the 01-loss. We choose $\mathcal{L}(z, y)$, specifically, since it is equivalent to the commonly-used binary cross entropy (modulo a constant factor).
\subsubsection{The Multiclass \texorpdfstring{$h\Delta\mathcal{H}$}{h{Delta}H}-divergence}
\label{sec:method_approx_mdp_div}
When $\mathcal{C}_h = h\Delta\mathcal{H}$ the divergence term is model-dependent and the expectation with respect to the Gibbs predictor $\mathbb{Q}$ becomes a challenge. For neural networks, even the Gibbs risk $\mathbf{R}_S(\mathbb{Q})$ does not have a known analytic solution. Instead, it is common to approximate using Monte-Carlo sampling \citep{langford2001not, dziugaite2017computing, dziugaite2021role, perez2021tighter}. By Hoeffding's Inequality, w.p. at least $1-\delta$, we approximate
\begin{equation}\label{eqn:montecarlo}\small
    \underset{H \sim \mathbb{Q}}{\mathbf{E}}[\mathbf{d}_{\mathcal{C}_H}(S_X, T_X)] \leq \frac{1}{k}\sum_{i=1}^k \mathbf{d}_{\mathcal{C}_{H_i}}(S_X, T_X) + \sqrt{\tfrac{\ln 2 / \delta}{2k}}
\end{equation}
where $(H_i)_{i=1}^k \sim \mathbb{Q}^k$. Using the RHS as an approximation, our computation reduces to computing $\mathbf{d}_{\mathcal{C}_h}$ for any deterministic $h \in \mathcal{H}$. Upon sampling from $\mathbb{Q}$, we can apply the algorithm for computing $\mathbf{d}_{\mathcal{C}_h}$ to each point in the sample. In light of this, the next part focuses on computing $\mathbf{d}_{\mathcal{C}_h}$ for deterministic $h$. We proceed as before, reducing computation to risk-minimization for a specific classification problem.

\textbf{Reduction to ERM with Constrained Labeling Function}
\begin{theorem}
\label{thm:mdp_div_red2erm}
Almost surely, for all $h \in \mathcal{H}$
\begin{equation}\small
\mathbf{d}_{\mathcal{C}_h}(S_X, T_X) =  \max \begin{rcases}
    \begin{dcases}
       1 - \underset{\bar{h} \in \Upsilon}{\min_ {\varphi \in \mathcal{H},}}
       \mathbf{R}_{P}(\varphi) + \mathbf{R}_{Q}(\varphi), \\
      1 - \underset{\bar{h} \in \Upsilon}{\min_ {\varphi \in \mathcal{H},}} \mathbf{R}_{U}(\varphi) + \mathbf{R}_{V}(\varphi)
    \end{dcases}
  \end{rcases}
\end{equation}
where $\mathcal{C}_h = h\Delta\mathcal{H}$ and
\begin{equation}\small
\begin{split}
    P({\bar{h}}) & = \big ((X_i, \bar{h}(X_i)) \mid X_i \in S_X \big ), \\
    Q & = \big ((\tilde{X}_i, h(\tilde{X}_i)) \mid \tilde{X}_i \in T_X \big ), \\
    U & = \big ((X_i, h(X_i)) \mid X_i \in S_X \big),\\
    V({\bar{h}}) & = \big((\tilde{X}_i, \bar{h}(\tilde{X}_i)) \mid \tilde{X}_i \in T_X \big)
\end{split}
\end{equation}
and $\Upsilon = \{\bar{h} \in \mathcal{Y}^\mathcal{X} \mid \bar{h}(x) \neq h(x), \ \forall x \in \mathcal{X}\}$.
\end{theorem}
As before, the result describes two classification problems. This time, the learner's goal is to agree with $h$ on one sample, while disagreeing with $h$ (in the way specified by $\bar{h}$) on the other sample. We minimize the class-conditional error rates by selecting from the class $\mathcal{H}$ used for the original prediction task, rather than $\mathcal{H}\Delta\mathcal{H}$. So, we make the obvious proposal: re-use whichever approximation technique we used to select $h$ in the first place. In our experiments, since $\mathcal{H}$ is a space of neural networks, we use batch SGD on an NLL loss.
% ; i.e., the weight is defined to balance the importance of the summed risks. 

% \textbf{A Heuristic for Selecting from $\Upsilon$}\hspace{.5em} 
\paragraph{A Heuristic for Selecting from $\Upsilon$}
Reduction to ERM in the multiclass setting also requires specification of $\bar{h} \in \Upsilon$. Specifically, $\bar{h}$ should aid in minimizing the class-conditional error rates. In our experiments, we found a simple strategy to be fairly effective. Namely, we specify $\bar{h}$ by always picking the label with the second-highest confidence according to $h$. So, $\bar{h}$ disagrees with $h$ on all of $\mathcal{X}$, but does so in the ``most reasonable'' way according to the probabilities output by $h$. This approach uses the probabilities output by $h$ to rank similarity of labeling functions and supposes the most ``similiar'' labeling function in $\Upsilon$ will be easiest for another hypothesis in $\mathcal{H}$ to simultaneously learn. Mathematically, our solution satisfies
\begin{equation}\small
\bar{h} \in \argmax\nolimits_{\upsilon \in \Upsilon} \sum\nolimits_{x \in \mathcal{X}} h_{\upsilon(x)}(x)
\end{equation}
where $h_\ell(x)$ is the score assigned to the label $\ell$. This heuristic is a practical solution that avoids search over $\Upsilon$, which will typically be unknown, unless we inefficiently enumerate using membership constraints. As we are aware, there is no known algorithm to efficiently select minimizers from $\Upsilon$ and $\mathcal{H}$, simultaneously, as called for by Thm.~\ref{thm:mdp_div_red2erm}. Besides the heuristic, we leave this problem as future work.

% \textbf{Comparison to Some Related Approaches}\hspace{.5em}
\paragraph{Comparison to Some Related Approaches}
% \citet{zhang2019bridging} propose a similar approach for approximating a mutliclass divergence which bounds a margin loss. In general, our two techniques require different consideration because of the difference in losses. 
Considering a binary label space $\mathcal{Y}$, \citet{kuroki2019unsupervised} propose a similar algorithm. The multiclass setting we consider does require some differences, primarily, related to the distinct degrees of freedom in multiclass and binary classification. First, our proposal removes the requirement that $\mathcal{\mathcal{H}}$ is symmetric since, as we are aware, this concept is not well-defined for hypotheses with multiclass output. Similar to before, we replace the symmetry required of $\mathcal{H}$ with symmetry in our classification problems. Second, in multiclass settings, the reduction strategy necessitates a new parameter to optimize: the labeling function $\bar{h}$. Besides our proposed heuristic for optimization, proof of this fact is \textit{not} a straightforward extension of the work of \citet{kuroki2019unsupervised}. In fact, it requires a different proof-technique (see Appendix~\ref{sec:mdp_div_red2erm}). In a multiclass setting, \citet{zhang2019bridging} also propose an approach for approximating a mutliclass divergence. In general, our two techniques require different consideration because their divergence is based on a margin loss, rather than 01-loss. Notably, the multiclass bounds of \citet{zhang2019bridging} use uniform sample complexity, unlike our proposed non-uniform approach. Further, working directly with 01-loss, as we do, avoids
any loosening of the bound via the margin penalty. 
\subsection{Efficiency Through Flat-Minima}
\label{sec:method_efficiency}
In full view of Section~\ref{sec:method_approx_mdp_div}, the reader may rightfully be concerned about the efficiency of the proposed technique for approximating $\mathbf{E}_H[\mathbf{d}_{\mathcal{C}_H}(\cdot, \cdot)]$. In particular, the suggestion requires training $k$ distinct neural networks: one for each $H_i \sim \mathbb{Q}$. Typically, $k$ will be large -- e.g., larger than 100 -- to control the size of the upperbound in Eq.~\eqref{eqn:montecarlo}, so this is not computationally feasible for practical applications. This problem is not totally unique to the model-dependent $h\Delta\mathcal{H}$-divergence, either. Common invariant feature-learning algorithms -- e.g., DANN \citep{ganin2015unsupervised} -- actually modify the feature distribution over which the classifier learns. In these cases, even the $\mathcal{H}\Delta\mathcal{H}$-divergence becomes dependent on the model \citep{johansson2019support}. In particular, supposing every model $h \in \mathcal{H}$ is the composition $c_h \circ f_h$ of a classifier $c_h$ and a feature extractor $f_h$, the modified $\mathcal{H}\Delta\mathcal{H}$-divergence results from the following restriction
\begin{equation}\small
    [\mathcal{H}\Delta\mathcal{H}]_h \defn \{1 - \mathbf{1}_{\{c_p\circ f_h(\cdot)\}}\{c_q \circ f_h(\cdot)\} \mid (p,q) \in \mathcal{H}^2\}.
\end{equation}
A similar restriction can be defined for the class $h\Delta\mathcal{H}$
\begin{equation}\small
    [h\Delta\mathcal{H}]_h \defn \{1 - \mathbf{1}_{\{c_h\circ f_h(\cdot)\}}\{c_q \circ f_h(\cdot)\} \mid q \in \mathcal{H}\}.
\end{equation}
In both cases, due to the dependence on $h$, the expectation over $\mathbb{Q}$ cannot be avoided as in Section~\ref{sec:method_approx_mid_div}. To resolve this frequent issue, we propose a new adaptation bound, which relies on an assumption related to flatness of the 01-loss over a (weighted) region in parameter space defined by $\mathbb{Q}$. Flatness assumptions are not unusual in PAC-Bayes and we develop this connection next.

\subsubsection{Flat-Minima and PAC-Bayes}
\label{sec:method_efficiency_bg}
An SGD solution lies in a flat-minimum if its parameters are robust to perturbation: changing the parameters (slightly) does not change the trained network's already low error rate. To put it another way, all parameter configurations near the SGD solution have identically low error. So, flatness here is an absence of ``elevation'' in error as our model moves about some region of parameter space. %This concept is further illustrated in Figure~\ref{fig:flat_help}. 
\citet{hochreiter1997flat} first discussed ``flatness'' as it relates to neural network generalization, hypothesizing that models lying in a large flat-minimum generalize well. More recently, the idea has been validated empirically at large scale. In particular, notions of the sharpness of minima are often good empirical descriptors of an SGD-trained neural network's generalization performance \citep{jiang2019fantastic, dziugaite2020search}. The motivation for using PAC-Bayes bounds is very often based on the hypothesis that flat-minima generalize well. This is because PAC-Bayes bounds implicitly encode the existence of flat-minima \citep{neyshabur2017exploring, dziugaite2017computing}. In details, for a bound to be small for some predictor $\mathbb{Q}$, both its Gibbs risk and its KL-divergence with the prior $\mathbb{P}$ must be small. Because the prior $\mathbb{P}$ typically has some variance, we know $\mathbb{Q}$ should have variance too, or else the KL-divergence will be large. Further, the variance of $\mathbb{Q}$ ensures a region of non-zero probability away from the mean. Thus, if the Gibbs risk is also small, it is required  that models in this region away from the mean all have identically low error (i.e., form a flat-minimum). Otherwise, the Gibbs risk would be inflated by probable parameter configurations with high error as illustrated in Figure~\ref{fig:flat_help}. In this sense, PAC-Bayes bounds and flat-minima go hand-in-hand. If the former is small, we know the latter exists.\footnote{This argument fails for some pathological cases. It works best with unimodal continuous $\mathbb{Q}$; e.g., the Gaussians used in Section~\ref{sec:experiments}.} 
% Note, the PAC-Bayes bound \textit{will} be small whenever it is acting as a good descriptor of neural network sample efficiency.
\begin{figure}
    \centering
    \includegraphics[width=.75\columnwidth]{figures/flat_help.png}
    \caption{\small Informal illustration of flat-minimum (right) and sharp-minimum (left). For ``flat'' regions in parameter space, a unimodal Gibbs predictor $\mathbb{Q}$ with some variance has consistently low error across probable samples from $\mathbb{Q}$. Otherwise, when a region is ``sharp'', there is non-negligible likelihood of sampling a hypothesis from $\mathbb{Q}$ with high error. The expected error over $\mathbb{Q}$ is thus inflated by these likely regions of high error. }
    \label{fig:flat_help}
\end{figure}
\subsubsection{A More Efficient Adaptation Bound}
% \begin{definition}
% Let $\mathfrak{D}(\mathcal{H})$ be the set of distributions over $\mathcal{H}$. A summary $\mu = \mu(\mathbb{Q}) \in \mathcal{H}$ of $\mathbb{Q}$ is the image of a map $ \mathfrak{D}(\mathcal{H}) \to \mathcal{H}$.
% \end{definition}
\begin{definition}\label{def:flatness} Let $\mathfrak{D}(\mathcal{H})$ be the space of distributions over $\mathcal{H}$ and fix a function $\mu: \mathfrak{D}(\mathcal{H}) \to \mathcal{H}$. The Gibbs predictor $\mathbb{Q}$ is $\rho$-flat on the distribution $\mathbb{D}$ if $\lvert \mathbf{R}_\mathbb{D}(\mathbb{Q}) - \mathbf{R}_\mathbb{D}(\mu(\mathbb{Q}))\rvert \leq \rho.$
\end{definition}
We call the function $\mu: \mathfrak{D}(\mathcal{H}) \to \mathcal{H}$ a \textit{summary function} and call the image of $\mu$ a \textit{summary}. When $\mathbb{Q}$ is implied, we typically abuse notation by writing $\mu = \mu(\mathbb{Q})$. Often, as in the definition above, we will refer to the ``flatness'' of a Gibbs predictor $\mathbb{Q}$ when it would be more precise to refer to the flatness of the (weighted) region in parameter space that this predictor defines (i.e., a around the summary $\mu$). In this sense, the above definition quantifies the flatness of a region in parameter space by the ability of the error in this region to be represented by a single hypothesis $\mu$ from that region. Intuitively, this echoes physical properties of flatness: a topographic map requires many more numbers to describe a mountainous terrain than a flat prairie. That is, each change in elevation for the mountainous terrain must be demarcated by individual numbers, while the flat prairie may only need one number to summarize the elevation. Similarly for the region around $\mu$ defined by the predictor $\mathbb{Q}$, a region is only ``flat'' if the error at $\mu$ is a good representative of the error across the whole region. Next, we give the proposed bound.
\begin{theorem}
\label{thm:pb-bound-efficient}
For any $\mathbb{P}$ over $\mathcal{H}$, all $\delta > 0$, w.p. at least $1-\delta$, for all $\mathbb{Q}$ over $\mathcal{H}$ s.t. $\mathbb{Q}$ is $\rho_S$-flat on $S$ and $\rho_T$-flat on $T$
\begin{equation}\small
\begin{split}
    \mathbf{R}_\mathbb{T}(\mathbb{Q}) & \leq \rho + \tilde{\lambda}_{S, T} + \mathbf{R}_S(\mathbb{Q}) + \mathbf{d}_{\mathcal{C}_\mu}(S_X, T_X) \\
    & + \sqrt{\tfrac{\mathrm{KL}(\mathbb{Q} \mid \mid \mathbb{P}) + \ln \sqrt{4m} - \ln ( \delta ) }{2m}}
\end{split}
\end{equation}
where $\mu$ is the summary of $\mathbb{Q}$, $\rho = \rho_S + \rho_T$, and $\mathcal{C}_\mu = \mu\Delta\mathcal{H}$.
\end{theorem}
\begin{corollary}\label{cor:pb-bound-efficient}
To study algorithms like DANN, we can instead choose $\mathcal{C}_\mu = [\mathcal{H}\Delta\mathcal{H}]_\mu$ or $\mathcal{C}_\mu = [\mu\Delta\mathcal{H}]_\mu$ in Thm.~\ref{thm:pb-bound-efficient}. The adaptability $\tilde{\lambda}_{S, T}$ is then dependent on $\mu$ as below 
\begin{equation}\small
    \small \tilde{\lambda}_{S, T}^\mu = \min\nolimits_{g \in \mathcal{H}} \Big \{ \mathbf{R}_S(c_g \circ f_\mu) + \mathbf{R}_T(c_g \circ f_\mu) \Big \}.
\end{equation}
% \lambda <= 
\end{corollary}
The main bound is identical to Thm.~\ref{thm:pb-bound} except that we assume $\mathbb{Q}$ is flat on both $S$ and $T$, then use this assumption to introduce a deterministic summary $\mu$ in the divergence. This deterministic summary replaces the expectation over $\mathbb{Q}$ whose estimation was inefficient, but the new cost is inflation of the bound by $\rho$. Unfortunately, similar to adaptability terms, we cannot expect to compute $\rho$ outside of controlled research contexts, since labels are required to estimate the flatness of $\mathbb{Q}$ on $T$ (according to Def.~\ref{def:flatness}). Instead, for the bound to be practically useful, we propose to assume $\rho$ is small. This, for example, is often the suggestion when it comes to adaptability as well. Albeit, the caveats of carelessly making assumptions on adaptation problems should be noted \citep{zhao2019learning, johansson2019support}. 

We argue the assumption of small $\rho$ is not an overly strong (or careless) assumption to make. To begin with, PAC-Bayes bounds and flat-minima are already related. The only addition we make to the usual connection (see Section~\ref{sec:method_efficiency_bg}) is that flat regions \textit{remain flat} when we transfer across data distributions (or, samples). Note, we do not even require the transferred region to remain a minimum, since the size of $\rho$ is only dictated by the \textit{difference} in the Gibbs risk and the summary risk: the Gibbs risk can be high on $T$ as long as the summary risk is as well. Thus, if one is willing to accept the usual assumptions, then our additional assumption merely begs the question: \textit{Do flat regions transfer?} 

In the next section, empirically, we test this question along with the other proposals given in this text.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experiments}
\label{sec:experiments}
\begin{figure*}
    \centering
    \includegraphics[width=\linewidth]{figures/all-2.png}
    \caption{\small Adaptability (left) and dependent/independent divergences (right) for \textbf{DANN} on \textbf{Digits}. Solid line is median. Scatter describes unique $(S, T, \mathbb{Q})$, limited to $95\%$ or more data to filter extreme values. $\mathbb{Q}$ is a multivariate Gaussian and $\mu$ is its mean.}
    \label{fig:dann-all}
\end{figure*}
\begin{figure}
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/rho.png}
    \caption{\small Histogram of $\rho$ estimates. Rug plot above $0.04$ displays infrequent occurrences. Each datum describes unique $(S, T, \mathbb{Q})$. $\mathbb{Q}, \mu$ are defined as in Figure~\ref{fig:dann-all}. See Appendix \ref{sec:exp_details_flatness} for details.}
    \label{fig:rho-hist}
\end{figure}
\begin{figure}
    \centering
    \includegraphics[width=.8\columnwidth]{figures/adaptability_small-2.png}
    \caption{\small Sample-dependent (left) and independent adaptability. Each datum is for unique $(S, T, \mathcal{H})$. See Appendix~\ref{sec:exp_details_ada} for details.}
    \label{fig:lambda}
\end{figure}
\subsection{Setup}
%\textbf{Datasets}\hspace{.5em}
\paragraph{Datasets}
We use a wide-array of datasets from vision and NLP: \textbf{Digits} \citep{ganin2015unsupervised}, \textbf{PACS} \citep{li2017deeper_PACS}, \textbf{Office-Home} \citep{venkateswara2017deep}, \textbf{Amazon Reviews} \citep{blitzer2007biographies}, and \textbf{Discourse} sense classification datasets \citep{prasad2008penn, ramesh2010identifying, zeyrek2020ted}. 
%The \textbf{Discourse} datasets, in particular, pose challenging real-world problems \citep{atwell-etal-2021-discourse}.  
For \textbf{Digits}, we use the image as feature, while for \textbf{Amazon Reviews}, we use uni-gram and bi-gram features. For other datasets, we use pre-trained ResNet-50 \citep{he2016deep} or BERT \citep{devlin2018bert} features.
% On these fixed features, we learn linear models or dense 4-layer neural networks. See Appendix~\textcolor{red}{X} for details.

% \textbf{Models}\hspace{.5em} 
\paragraph{Models}
For \textbf{Digits}, we use a 4-layer CNN. For all other datasets, we use both a linear model and a 4-layer fully-connected network. For simplicity, our larger scale experiments use a simple adaptation algorithm (\textbf{SA}) which optimizes models to minimize risk on $S$. On \textbf{Digits}, we also study the \textbf{DANN} algorithm proposed by \citet{ganin2015unsupervised}, modified to train Gibbs predictors with varied regularization using \textbf{PBB} \citep{perez2021tighter}. 
%While training is more time-consuming and involved, this allows us to apply our bounds to study sample complexity.
We pick \textbf{Digits}, specifically, because it exhibits shift in the marginal label distributions, which can cause \textbf{DANN} to fail \citep{zhao2019learning}.
More training details are given in Appendix~\ref{sec:exp_details}.

% \textbf{Experiments}\hspace{.5em}
\paragraph{Experiments}
The data points in our results are each individual experiments done on a source dataset $S$ and target dataset $T$ using a classifier $h$ or Gibbs predictor $\mathbb{Q}$. The pair $S$ and $T$ are taken from a set of data splits using the datasets discussed above (details in Appendix~\ref{sec:exp_details}). Across these splits, we consider various scenarios including: \textbf{single-source}, \textbf{multi-source}, and \textbf{within-distribution} adaptation (i.e., $\mathbb{S} = \mathbb{T}$) using multiple random data splits. On \textbf{Digits}, we also consider \textbf{natural shifts}  (i.e., noise and rotation) and \textbf{unnatural shifts} (i.e., transfer to random data). In general, we restrict the pair $(S, T$) to have a common label space. 
%Typically, we pick $\mathbb{Q}$ to be a multi-variate normal distribution (i.e., over the parameters) with diagonal covariance matrix. The prior $\mathbb{P}$ is also a multi-variate normal distribution with diagonal covariance matrix $\sigma \mathbf{I}$. ($\sigma=0.01$ unless otherwise noted). Training of the model $h$ or $\mathbb{Q}$ differs slightly depending on context. For simplicity, our larger scale experiments use a simple adaptation algorithm (\textbf{SA}) which optimizes models to minimize risk on $S$ without using data from $T$. On \textbf{Digits}, we also study a common invariant feature-learning algorithm (\textbf{DANN}) prposed by \citet{ganin2015unsupervised}. We modify this algorithm to train Gibbs predictors using the technique of \citet{perez2021tighter}. While training is more time-consuming and involved, this allows us to apply our bounds to study sample complexity. Additional details on both approaches are available in Appendix~\textcolor{red}{X} with some further discussion on \textbf{DANN} in the next part. Overall, accounting for each $(S,T)$ pair and each random seed for model training, the number of $(S,T,h)$ triples and $(S,T,\mathbb{Q})$ triples we study totals more than \textcolor{red}{XXXX}.
%\textbf{Additional Details for DANN Experiments}\hspace{.5em} In Figure~\ref{fig:dann-all}, we use Cor.~\ref{cor:pb-bound-efficient} to analyze properties of the \textbf{DANN} algorithm as a function of sample complexity. In particular, we train $\mathbb{Q}$ with different sample complexity by varying the degree of regularization in the techniuqe of \citet{perez2021tighter}. This exploration adds significant time to each experiment, so we limit our investigation to a downsampled version of the \textbf{Digits} dataset with about 5K examples per sample. We pick \textbf{Digits}, specifically, because it exhibits \textit{label shift} which is one property that can cause \textbf{DANN} to fail \citep{zhao2019learning}. To create Figure~\ref{fig:dann-all}, we report adaptability and model-dependent/independent divergence when using our stochastic variant of \textbf{DANN}. Specifically, these are given by the quantities in Cor.~\ref{cor:pb-bound-efficient}. For comparison, we also report adaptability and model-dependent divergence without use of \textbf{DANN} as given by the quantities in Thm.~\ref{thm:pb-bound-efficient}. Notice, $\rho$ is independent of the application of DANN, but Figure~\ref{fig:dann-all} does show how $\rho$ changes as a function of the variance parameter $\sigma$ for the prior $\mathbb{P}$. The mean of $\mathbb{P}$ in these experiment is selected by minimizing risk on $S$ (e.g., as in \textbf{SA}).
\subsection{Results}
% Rank Correlation
% H Div Corr:  0.537905744399355
% Our H Div Corr:  0.5843544383404338
% MMD BBSD 0.28800042368837947
% digits H Div Corr:  0.15098017471421185
% digits Our H Div Corr:  0.22586119022074952
% digits MMD BBSD:  0.630225649470099
% disc H Div Corr:  0.6977063610103937
% disc Our H Div Corr:  0.7827306031831462
% disc MMD BBSD:  0.397454209298847
% images H Div Corr:  0.40514741234072643
% images Our H Div Corr:  0.13947922846565158
% images MMD BBSD:  0.4266832357175109
% amaz H Div Corr:  -0.047245476553481984
% amaz Our H Div Corr:  0.40656775610827284
% amaz MMD BBSD:  0.47813771894382073
% Linear
% H Div Corr:  0.4724081637715984
% Our H Div Corr:  0.6126433103953133
% MMD BBSD 0.3656041862184042
% digits H Div Corr:  -0.0747099368251497
% digits Our H Div Corr:  0.4611444304459641
% digits MMD BBSD:  0.5342168010924534
% disc H Div Corr:  0.5547267015489086
% disc Our H Div Corr:  0.7333759038597105
% disc MMD BBSD:  0.367938256248469
% images H Div Corr:  0.22462854324044457
% images Our H Div Corr:  0.3036549293174042
% images MMD BBSD:  0.5144914811121526
% amaz H Div Corr:  -0.06869271520178998
% amaz Our H Div Corr:  0.3027999738886635
% amaz MMD BBSD:  0.4161227468781624
% Rank
% Linear
% \begin{table}\small
%     \centering
%     \begin{tabular}{c|c|c|c|c|c}
%     &  \textbf{All} & \textbf{Digi.} & \textbf{Disc.} & \textbf{PACS+OH} & \textbf{Amaz.}\\\hline
%     \textbf{model-ind.} & 0.47 & -0.07 & 0.56 & 0.23 & -0.07 \\\hline
%     \textbf{model-dep.} & 0.61 & 0.46 & 0.73 & 0.30 & 0.30  \\\hline
%     \textbf{bbsd} & 0.37 & 0.53 & 0.37 & 0.52 & 0.42
%     \end{tabular}
%     \caption{Correlation of $h$-dependent and -independent divergence with $|\mathbf{R}_h(S) - \mathbf{R}_h(T)|$. \textbf{bbsd} is the shift-detection approach of \citet{rabanser2018failing} for reference. Columns delineate data subsets. Each datum describes unique $(S, T, h)$.}
%     \label{tab:divapprox}
% \end{table}
% \begin{figure}
%     \centering
%     \includegraphics[width=\columnwidth]{figures/div.png}
%     \caption{\small Model-independent and -dependent divergence  for DANN experiments on \textbf{Digits}. Solid lines show median and scatter shows each datum -- a unique $(S, T, \mathbb{Q})$ triple.}
%     \label{fig:dann-div}
% \end{figure}
% \begin{figure}
%     \centering
%     \includegraphics[width=\columnwidth]{figures/rho-2.png}
%     \caption{\small Upperbounds for $\rho$ in DANN experiments using different prior variance. Solid lines show median and scatter shows each datum -- a unique $(S, T, Q)$ triple with $Q \sim \mathbb{Q}^k$.}
%     \label{fig:dann-rho}
% \end{figure}
% \textbf{Sample-Dependent Adaptability}\hspace{.5em}
\paragraph{Sample-Dependent Adaptability}
As mentioned, estimation of sample-independent adaptability (e.g., $\lambda$ in Thm.~\ref{thm:ben2010theory}) requires verification of generalization. In particular, to estimate $\lambda$ one can learn $\eta \in \mathcal{H}$ which has small sum of risks over the observed samples $S$ and $T$. In our experiments, we do so using batch SGD on a weighted NLL loss -- a common surrogate. Because $\lambda$ is a population statistic for $\mathbb{S}$ and $\mathbb{T}$, we cannot directly report the errors on $S$ and $T$ -- this is incorrect, like using training error as a validation metric. Instead, we should check the performance of $\eta$ on a heldout data subset (for example). This is the strategy we take in Figure~\ref{fig:lambda}, using Hoeffding's Inequality to produce a valid upperbound on $\lambda$. Comparably, estimating sample-dependent adaptability is much easier. By design, we \textit{can} report error on the samples $S$ and $T$ used for training $\eta$. Doing so, produces a valid upperbound:
\begin{equation}\small\label{eqn:lambda_ub}
    \forall \eta \in \mathcal{H} \ : \ \tilde{\lambda}_{S,T} \leq \mathbf{R}_S(\eta) + \mathbf{R}_T(\eta) \quad\text{(by definition).}
\end{equation}
As is visible in Figure~\ref{fig:lambda}, this strategy for estimating $\tilde{\lambda}$ is much more effective than the sample-independent strategy in revealing important information. We see from the histogram of upperbounds on $\tilde{\lambda}$ that adaptability is very often small and concentrated near 0, although this is not always the case. Comparatively, upperbounds for $\lambda$ are spread out with notable mass at large values; we miss out on the interpretation that adaptability very often is small (as we might like to assume, in practice). In the rest of our discussion, all adaptability will be sample-dependent. Note, additional experiments on adaptability are available in Appendix~\ref{sec:adaptability_boxp}.
\begin{table}\small
    \centering
    \caption{\small Correlation of $h$-dependent and -independent divergence with $|\mathbf{R}_h(S) - \mathbf{R}_h(T)|$. Columns delineate data subsets. Each datum describes unique $(S, T, h)$. See Appendix~\ref{sec:exp_details_divapprox} for details.}
    \begin{tabular}{c|c|c|c|c|c}
    &  \textbf{All} & \textbf{Digi.} & \textbf{Disc.} & \textbf{PACS+OH} & \textbf{Amaz.}\\\hline
    \textbf{model-ind.} & 0.54 & 0.15 & 0.70 & 0.41 & -0.05 \\\hline
    \textbf{model-dep.} & 0.58 & 0.23 & 0.78 & 0.14 & 0.41  \\% \hline
    % \textbf{bbsd} & 0.29 & 0.63 & 0.40 & 0.43 & 0.48
    \end{tabular}
    \label{tab:divapprox}
\end{table}

% \textbf{Divergence and Approximation}\hspace{.5em}
\paragraph{Divergence and Approximation}
In Table~\ref{tab:divapprox}, we give results for our approximation techniques applied to the model-dependent $h\Delta\mathcal{H}$-divergence and the model-independent $\mathcal{H}\Delta\mathcal{H}$-divergence. The models used in these experiments are trained using \textbf{SA}. Since there is actually no ground-truth to compare too, we report performance of our approximations on a ranking task. That is, we compare our approximations to absolute difference in risks on the source and target and compute the Spearman rank correlation. According to our adaptation bounds, smaller divergence should predict smaller difference in risk and larger divergence should predict larger difference in risk as in the ranking task we study. Any effective approximation of divergence should also mimic this behavior, allowing us to conduct an indirect evaluation. In aggregate, we observe both divergences are capable of ranking performance similarity on the source and target, which validates our approximations to some extent. For reference, a recent statistic designed for shift-detection \citep{rabanser2018failing} achieves correlation \textbf{0.29} on all data. We also observe the model-dependent divergence typically ranks ``better'' than the model-independent divergence. This, also, is to be expected according to our theory, since the model-independent divergence does not account for variation in $h$ and should thus perform worse. Overall, the nuanced agreement of our approximations with our theoretical expectations is suggestive that these techniques are effective. 
% Note, our approximations would be \textit{exact} if efficient (non-approximate) ERM was a solved problem for neural-networks.

%\textbf{Do Flat Regions Transfer?}\hspace{.5em}
\paragraph{Do Flat Regions Transfer?}
As noted, one stipulation of practical use for Thm.~\ref{thm:pb-bound-efficient} is a small flatness value $\rho$. This is not unlike the common assumption that $\lambda$ is small and, as discussed, is related to the flat-minma hypothesis. To estimate $\rho$ and test our assumption, we select $\rho_S$ and $\rho_T$ to be the smallest values so that Def.~\ref{def:flatness} is satisfied on $S$ and $T$ using a Monte-Carlo estimate for the Gibbs Risk.\footnote{A penalty based on Hoeffding Inequality could be added to this estimate to create a valid upperbound. We do not consider this since a penalty is also added if we use the strategy in Eq.~\eqref{eqn:montecarlo}.} We train $\mathbb{Q}$ using a variant of \textbf{SA} based on the technique of \citet{perez2021tighter}. Our results indicate $\rho$ is typically small as desired with mean \textbf{0.007} and SD \textbf{0.01} across 4K+ experiments. See Figure~\ref{fig:rho-hist} for a visualization.

%\textbf{Analysis of Assumptions after DANN}\hspace{.5em}
\paragraph{Analysis of Assumptions after DANN}
Our results in Figure~\ref{fig:dann-all} show an interesting relationship between the sample complexity of $\mathbb{Q}$ -- as measured by $\mathrm{KL}(\mathbb{Q} \mid\mid \mathbb{P})$ -- and our assumption on adaptability. Namely, we can be more confident in the assumption $\tilde{\lambda}$ is small when the sample complexity of our solution increases. A similar observation holds for the flatness term $\rho$ (see Appendix~\ref{sec:ext_dann_res} Figure~\ref{fig:rho-dann}). Our analysis suggests \textbf{DANN} may be a data-hungry algorithm, since solutions with properties we desire have large sample complexity. The practical suggestion is to use large quantities of unlabeled data when applying \textbf{DANN}, which is reasonable since unlabeled data can be ``cheap'' to acquire.

% \textbf{Analysis of Divergence after DANN}\hspace{.5em}
\paragraph{Analysis of Divergence after DANN}
% To understand our interpretation, it is import to note that the point of the \textbf{DANN} algorithm is reduce the divergence between the source and target distributions, since this commonly appears in upperbounds on the target risk. 
In Figure~\ref{fig:dann-all}, according to the (more sensitive) model-dependent divergence, \textbf{DANN} reduces data-distribution divergence as it is designed to do. Still, it does not reduce divergence to the degree one might expect and, as the sample complexity of the solution increases, the gap between divergences -- before and after \textbf{DANN} -- begins to wane.
%\footnote{This may feel counter-intuitive, since an unconstrained solution should better optimize divergence; it is import to remember a second objective of \textbf{DANN} is to reduce risk on the source, so other factors are contributing to the complexity of a solution.} 
This is interesting because it shows reduction of divergence and reduction of adaptability / flatness may be competing objectives. Further, this finding echoes theoretical hypotheses in recent literature \citep{zhao2019learning, wu2019domain, johansson2019support}, while also revealing the role of sample complexity in this story. To meet our assumptions when using \textbf{DANN}, we should use large amounts of unlabeled data and allow an unconstrained solution, but to ensure \textbf{DANN} reduces distribution divergence significantly, we should instead constrain our solution to lower complexity (e.g., via regularization). Depending on problem context, there may be some optimum between these extremes, but in any case, these opposing relationships are an interesting take-away from the application of our theory.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}
\label{sec:conclusion}
In this work, we proposed the first adaptation bounds capable of studying the non-uniform sample complexity of adaptation algorithms using multiclass neural networks. Empirically, we validated the novel design-concepts in our adaptation bounds and showed our approximation techniques for some multiclass divergences were effective. In culmination, we applied our bounds to study sample complexity of a common domain-invariant learning algorithm. Our findings revealed unexpected relationships between sample complexity and important properties of the algorithm we studied. Code for reproducing our experiments is publicly available at \url{https://github.com/anthonysicilia/pacbayes-adaptation-UAI2022}.

Besides what has been done in this work, we also identify some areas of potential future work:
\paragraph{Assumptions and Heuristics}
As with previous adaptation bounds, the nature of the adaptation problem requires us to be imprecise in some cases. For one, we make a number of assumptions on adaptability and flatness. Also, our divergence computation does require some heuristics. While we study these imperfections empirically with promising results, we anticipate both shortcomings can be improved. In particular, restriction of scope to specific domains or hypothesis classes should reveal exploitable problem structure. 
\paragraph{Generalized Loss}
While we have focused on multiclass learners in this work, a PAC-Bayesian adaptation bound for general learners (e.g., with bounded loss functions) remains an open-problem. Possibly, applying our strategies to the more general framework of \citet{mansour2009domain} would be fruitful. Albeit, since algorithms for computing divergence have traditionally been loss-specific, we expect additional theoretical derivation to be required for each new loss.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{acknowledgements} We thank the anonymous reviewers for helpful feedback. 

S. Hwang was supported by Institute of Information \& communications Technology Planning \& Evaluation (IITP) grant funded by the Korea government (MSIT), Artificial Intelligence Graduate Program, Yonsei University (2020-0-01361-003), and the Yonsei University Research Fund of 2022 (2022-22-0131).
\end{acknowledgements}
% \clearpage
\bibliography{sicilia_277}
\end{document}
