\documentclass[accepted]{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\usepackage{hyperref}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
% if you use cleveref.
\usepackage[capitalize,noabbrev]{cleveref}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\usepackage[textsize=tiny]{todonotes}
\usepackage{dsfont}

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 



\externaldocument{nguyen_632}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Efficient Failure Pattern Identification of Predictive Algorithms\\(Supplementary material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{Bao Nguyen}
\author[3]{Viet Anh Nguyen}

\affil[1]{%
    School of Information and Communication Technology\\
    Hanoi University of Science and Technology\\
    Vietnam
}
\affil[2]{%
    College of Engineering \& Computer Science\\
    VinUni-Illinois Smart Health Center\\
    VinUniversity\\
    Vietnam
}
\affil[3]{%
    The Chinese University of Hong Kong
}

\input{comments.tex}
\begin{document}
  
\onecolumn 
\maketitle 

\renewcommand\thefigure{\thesection.\arabic{figure}}    

\appendix
\section{Additional Experiments}
\label{Additional Experiments}
\subsection{Motivation for the specific user's defined failure mode definition}
\label{sec:specific-user}

In this section, we provide the motivation, theoretical justification, and practical effectiveness of the failure mode definition based on mutual nearest neighbor graph\footnote{A mutual $k_{nn}$-nearest neighbor graph is a graph where there is an edge between $x_i$ and $x_j$ if  $x_i$ is one of the $k_{nn}$ nearest neighbors of  $x_j$ and $x_j$ is one of the $k_{nn}$ nearest neighbors of  $x_i$.} on embedding space.
\begin{itemize}
    \item We make a similar assumption with \citet{ref:d2022spotlight} and \citet{ref:sohoni2020no} that the classifier’s activations layer contains essential information about the semantic features used for classification. The proximity between two points in this embedding space could indicate their semantic similarity. Hence, issuing an edge between two points as in the mutual nearest neighbor graph likely guarantees that two connected points have much more semantic similarity than other pairs. This would ensure semantic cohesion for the points within a failure mode according to our definition.
    \item  Regarding the theoretical aspect, we use the mutual nearest neighbor graph, which is effective in clustering and outliers detection (see \citet{ref:song2022graph}, \citet{ref:song2022survey} and \citet{ref:brito1997connectivity}). Moreover, \citet[Theorem~2.2]{ref:brito1997connectivity} stated that with the reasonable choice of $k_{nn}$, connected components (a.k.a.~maximally connected subgraphs) in a mutual $k_{nn}$-graph are consistent for the identification of its clustering structure.
\item  In terms of more visual representation, we show images of four failure patterns of dataset id\_1 in Figure~\ref{fig:gt} to show the effectiveness of this definition on detecting semantic-cohesion clusters. We can observe that each failure pattern has a common concept recognizable by humans and includes images that are all misclassified. The top-left mode includes images of blonde-haired girls with tanned skin. The top-right mode includes images of girls wearing earrings. The bottom-left mode contains photos with tilted angles, and the bottom-right mode contains images with dark backgrounds.
\end{itemize}
\begin{figure}[h]
    \centering
    \includegraphics[scale=0.475]{sss.png}
    \caption{Failure patterns existing in dataset id\_1. One can observe four distinct failure patterns in this dataset.}
    \label{fig:gt}
\end{figure}

\subsection{Datasets and Implementation Details}
\label{sec:dataset}

We describe fifteen datasets used in our work in Table~\ref{tab:dataset}. 

\begin{table}
\centering
\caption{The description of 15 datasets that are used in the numerical experiments.}
\begin{tabular}{cccccccc}
\hline
Dataset & DcBench   & Noise Magnitude       & SNR  & $M$  & $k_{nn}$ & Sample size & Number of misclassified samples \\ \hline
id\_1   & p\_72799  & Low     & 0.15 & 10 & 7     & 6088                  & 572                                \\ \hline
id\_2   & p\_122144 & Low     & 0.22 & 10 & 7     & 6103                  & 1076                               \\ \hline
id\_3   & p\_121880 & Low    & 0.38 & 10 & 7     & 5969                  & 1259                               \\ \hline
id\_4   & p\_122653 & Low    & 0.47 & 10 & 7     & 6019                  & 1088                               \\ \hline
id\_5   & p\_118660 & Low    & 0.47 & 10 & 8     & 5994                  & 1019                               \\ \hline
id\_6   & p\_122145 & Medium & 0.69 & 10 & 11    & 6135                  & 1141                               \\ \hline
id\_7   & p\_121753 & Medium & 0.96 & 10 & 10    & 6138                  & 1612                               \\ \hline
id\_8   & p\_122406 & Medium & 1.17 & 10 & 16    & 6072                  & 937                                \\ \hline
id\_9   & p\_118049 & Medium & 1.38 & 10 & 12    & 5979                  & 1051                               \\ \hline
id\_10  & p\_122150 & Medium & 1.39 & 10 & 10    & 6107                  & 1304                               \\ \hline
id\_11  & p\_121948 & High   & 1.75 & 10 & 15    & 6027                  & 1438                               \\ \hline
id\_12  & p\_122417 & High   & 1.85 & 10 & 19    & 6035                  & 1096                               \\ \hline
id\_13  & p\_122313 & High   & 1.91 & 10 & 15    & 6048                  & 1011                               \\ \hline
id\_14  & p\_121977 & High   & 2.19 & 10 & 17    & 6117                  & 1153                               \\ \hline
id\_15  & p\_121854 & High   & 3.78 & 10 & 24    & 6017                  & 1554                               \\ \hline
\end{tabular}
\label{tab:dataset}
\end{table}


\textbf{Preprocessing}: We single out 15 datasets from \cite{ref:eyuboglu2022domino}, each includes three features: Activation, True Label, and Pseudo Label. After that, we preprocess the data using a standard scaler for the Activation feature. 

\textbf{Ground truth generation}: It is necessary to assign values of $k_{nn}$ and $M$ to each preprocessed dataset. The value of $M$ expresses the level of evidence required for confirming the failure patterns. A higher value of $M$ indicates a greater emphasis on the patterns that exist most frequently in the dataset. As $M$ decreases to 1, the problem transforms into identifying misclassified data points, where each failure data point constitutes a pattern. Moreover, the users choose $M$ so that they can perceive the shared concept of $M$ samples. If $M$ is too small, then the concept may not be distinctive enough between clusters, while if $M$ is too large, the users may have a bottleneck in identifying the shared concept. The value of $k_{nn}$ signifies the coherence required for data points within a pattern.  The users choose a smaller $k_{nn}$ if they need strong tightness between samples in a failure mode. \citet{ref:brito1997connectivity} recommended choosing $k_{nn}$ of order $\log(N)$ for consistent identification of the clustering structure. A smaller value of $k_{nn}$ imposes a more stringent condition to create an edge in the $k_{nn}$ graph. When $k_{nn}$ = 0, each data point is only connected with itself. If $k_{nn}$ is sufficiently high, all misclassified data points merge to form a single failure pattern. From Figure~\ref{fig:datasets}, we notice that as the increase of $k_{nn}$ and SNR, there is a tendency to appear big patterns with a large number of data points. We could explain it as follows. When increasing $k_{nn}$, more edges are additionally created, which could initially connect separate patterns or augment more data points into the patterns. In practical applications of this problem, it is important to note that the two parameters $k_{nn}$ and $M$ rely heavily on the users, the machine learning tasks, and the nature of the dataset. In this study, we have established a fixed value of M equal to 10 for all datasets, and we have varied the value of $k_{nn}$ to generate diverse scenarios of Signal-to-Noise Ratio (SNR). With the defined value of $k_{nn}$, we have constructed the $k_{nn}$ graph of the re-scaled Activation feature. Subsequently, we have employed a simple Depth First Search algorithm on the sub-graph of only misclassified data points to collect all maximally connected components with cardinality greater than $M$. These components represent patterns that are the focus of the recommending algorithms. We add one additional feature named Pattern to each data point which indicates the pattern of it or $-1$ if it does not belong to any patterns. 

Finally, the complete dataset for our problem consists of four information: Activation, True Label, Pseudo Label, and Pattern.
\begin{figure}
	\begin{minipage}[t]{0.5\textwidth}
		\centering
		\includegraphics[scale = 0.65]{id_2.pdf}
        \caption*{id\_2 dataset (SNR = 0.22)}
	\end{minipage}
	\begin{minipage}[t]{0.5\textwidth}
		\centering
		\includegraphics[scale = 0.65]{id_5.pdf}
        \caption*{id\_5 dataset (SNR = 0.47)}
	\end{minipage}
	\begin{minipage}[t]{0.5\textwidth}
		\centering
		\includegraphics[scale = 0.65]{id_8.pdf}
        \caption*{id\_8 dataset (SNR = 1.17)}
	\end{minipage}
	\begin{minipage}[t]{0.5\textwidth}
		\centering
		\includegraphics[scale = 0.65]{id_12.pdf}
        \caption*{id\_12 dataset (SNR = 1.85)}
	\end{minipage} 
   \caption{The 2-D visualization of the Activation feature in four datasets. To downsample from a 512-dimension vector to a 2-dimension vector, we utilize the Supervised Dimension Reduction technique introduced by \citet{ref:mcinnes2018umap}.}
    \label{fig:datasets}
\end{figure}


\subsection{Additional Numerical Results}
In the main paper, we present the numerical results for groups categorized into three levels of Signal-to-Noise Ratio (SNR). In this section, we offer a comprehensive breakdown of the results for each individual dataset in Tables~\ref{tab:0.1detailed},~\ref{tab:0.2detailed}, and~\ref{tab:sensitivity}, respectively.

We also provide charts that illustrate the progress of algorithms over iterations in dataset id\_10, as depicted in Figure~\ref{fig:convergence}. The blue line represents the percentage of queried samples, which appears linear due to the fixed size of the queried batch at each iteration. The orange line indicates the percentage of detected misclassified samples out of the total misclassified ones in the dataset. The green line represents the percentage of detected failure modes out of the total number of failure modes in the dataset. It is evident that the orange line, corresponding to methods that incorporate our exploiting component (Gaussian process component) such as DS\_0.0, DS\_0.25, DS\_0.5, and DS\_0.75, consistently outperforms the blue lines significantly. This trend clearly demonstrates the effectiveness of our exploiting term in identifying misclassified samples.

However, DS\_0.0 shows inferior performance, as evidenced by the green line consistently falling below the blue line throughout the iterations, despite its effectiveness in identifying misclassified samples. In contrast, DS\_0.25, DS\_0.5, and DS\_0.75 exhibit superb performance in detecting all failure patterns within approximately 100 iterations (40\% of the dataset samples). This difference can be attributed to the absence of the exploration term in DS\_0.0 when dealing with a high SNR level in dataset id\_10.

\begin{table}
\centering
\caption{Benchmark of Effectiveness (at 10\% of sample size) on different noise magnitudes. Larger values are better.  Bolds indicate the best methods for each dataset.}
\begin{tabular}{ccccccccc}
\hline
Dataset & US & DS\_0.0 & DS\_0.25 & DS\_0.5 & DS\_0.75 & DS\_1.0 & Coreset & BADGE \\ \hline
id\_1 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_2 & 0.00±0.00 & \textbf{0.25±0.00} & 0.00±0.00 & 0.25±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_3 & 0.00±0.00 & \textbf{0.14±0.00} & \textbf{0.14±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_4 & 0.00±0.00 & 0.00±0.00 & \textbf{0.33±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_5 & 0.00±0.00 & \textbf{0.12±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_6 & 0.00±0.00 & \textbf{0.33±0.00} & 0.17±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_7 & 0.00±0.00 & \textbf{0.25±0.00} & \textbf{0.25±0.00} & \textbf{0.25±0.00} & \textbf{0.25±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_8 & 0.01±0.03 & \textbf{0.17±0.00} & 0.00±0.00 & \textbf{0.17±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_9 & 0.00±0.00 & \textbf{0.20±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_10 & 0.00±0.00 & 0.25±0.00 & \textbf{0.50±0.00} & \textbf{0.50±0.00} & 0.25±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_11 & 0.00±0.00 & \textbf{0.33±0.00} & 0.00±0.00 & 0.00±0.00 & \textbf{0.33±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_12 & 0.01±0.04 & \textbf{0.25±0.00} & \textbf{0.25±0.00} & \textbf{0.25±0.00} & \textbf{0.25±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_13 & 0.00±0.00 & \textbf{0.33±0.00} & \textbf{0.33±0.00} & \textbf{0.33±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_14 & 0.01±0.04 & \textbf{0.50±0.00} & 0.00±0.00 & 0.25±0.00 & \textbf{0.50±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
id\_15 & 0.07±0.17 & \textbf{0.50±0.00} & \textbf{0.50±0.00} & \textbf{0.50±0.00} & \textbf{0.50±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
Overall & 0.01±0.05 & \textbf{0.24±0.14} & 0.17±0.18 & 0.17±0.18 & 0.14±0.18 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 \\ \hline
\end{tabular}
\label{tab:0.1detailed}
\end{table}

\begin{table}
\caption{Benchmark of Effectiveness (at 20\% of sample size) on different noise magnitudes. Larger values are better.  Bolds indicate the best methods for each dataset.}
\centering
\begin{tabular}{ccccccccc}
\hline
Dataset & US & DS\_0.0 & DS\_0.25 & DS\_0.5 & DS\_0.75 & DS\_1.0 & Coreset & BADGE \\ \hline
id\_1 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_2 & 0.01±0.04 & \textbf{0.25±0.00} & 0.00±0.00 & \textbf{0.25±0.00} & \textbf{0.25±0.00} & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_3 & 0.00±0.00 & \textbf{0.29±0.00} & 0.14±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_4 & 0.01±0.06 & \textbf{0.67±0.00} & 0.33±0.00 & 0.33±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_5 & 0.00±0.00 & \textbf{0.25±0.00} & 0.00±0.00 & 0.12±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_6 & 0.00±0.00 & \textbf{0.67±0.00} & 0.17±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_7 & 0.02±0.08 & \textbf{0.25±0.00} & \textbf{0.25±0.00} & \textbf{0.25±0.00} & \textbf{0.25±0.00} & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_8 & 0.01±0.03 & \textbf{0.17±0.00} & \textbf{0.17±0.00} & \textbf{0.17±0.00} & 0.00±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_9 & 0.02±0.06 & \textbf{0.20±0.00} & \textbf{0.20±0.00} & \textbf{0.20±0.00} & \textbf{0.20±0.00} & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_10 & 0.05±0.10 & 0.25±0.00 & 0.50±0.00 & \textbf{0.75±0.00} & \textbf{0.75±0.00} & 0.25±0.00 & 0.00±0.00 & 0.005±0.100 \\ \hline
id\_11 & 0.06±0.12 & 0.33±0.00 & 0.33±0.00 & \textbf{0.67±0.00} & 0.33±0.00 & 0.00±0.00 & 0.00±0.00 & 0.003±0.100 \\ \hline
id\_12 & 0.12±0.12 & 0.25±0.00 & \textbf{0.50±0.00} & 0.25±0.00 & \textbf{0.50±0.00} & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_13 & 0.04±0.11 & 0.33±0.00 & \textbf{0.67±0.00} & \textbf{0.67±0.00} & 0.33±0.00 & 0.00±0.00 & 0.00±0.00 & 0.000±0.000 \\ \hline
id\_14 & 0.11±0.12 & \textbf{0.50±0.00} & \textbf{0.50±0.00} & \textbf{0.50±0.00} & \textbf{0.50±0.00} & 0.25±0.00 & 0.25±0.00 & 0.100±0.120 \\ \hline
id\_15 & 0.48±0.09 & \textbf{0.50±0.00} & \textbf{0.50±0.00} & \textbf{0.50±0.00} & \textbf{0.50±0.00} & 0.50±0.00 & 0.00±0.00 & 0.150±0.230 \\ \hline
Overall & 0.06±0.14 & \textbf{0.33±0.18} & 0.28±0.21 & 0.31±0.24 & 0.24±0.23 & 0.07±0.14 & 0.02±0.06 & 0.020±0.090 \\ \hline
\end{tabular}
\label{tab:0.2detailed}
\end{table}

\begin{table}
\caption{Benchmark of Sensitivity on different noise magnitudes. Smaller values are better. Bolds indicate the best methods in each dataset.}
\centering
\begin{tabular}{ccccccccc}
\hline
Dataset & US & DS\_0.0 & DS\_0.25 & DS\_0.5 & DS\_0.75 & DS\_1.0 & Coreset & BADGE \\ \hline
id\_1 & 0.55±0.00 & 0.76±0.00 & \textbf{0.23±0.00} & 0.57±0.00 & 0.44±0.00 & 0.62±0.00 & 0.76±0.00 & 0.660±0.009 \\ \hline
id\_2 & 0.35±0.00 & 0.09±0.00 & 0.23±0.00 & \textbf{0.05±0.00} & 0.14±0.00 & 0.51±0.00 & 0.60±0.00 & 0.600±0.100 \\ \hline
id\_3 & 0.59±0.00 & \textbf{0.05±0.00} & 0.07±0.00 & 0.42±0.00 & 0.31±0.00 & 0.49±0.00 & 0.55±0.00 & 0.550±0.007 \\ \hline
id\_4 & 0.44±0.00 & 0.13±0.00 & \textbf{0.06±0.00} & 0.11±0.00 & 0.22±0.00 & 0.57±0.00 & 0.60±0.00 & 0.530±0.006 \\ \hline
id\_5 & 0.53±0.00 & \textbf{0.08±0.00} & 0.21±0.00 & 0.19±0.00 & 0.23±0.00 & 0.33±0.00 & 0.57±0.00 & 0.470±0.006 \\ \hline
id\_6 & 0.48±0.00 & \textbf{0.03±0.00} & 0.04±0.00 & 0.47±0.00 & 0.29±0.00 & 0.32±0.00 & 0.55±0.00 & 0.400±0.007 \\ \hline
id\_7 & 0.37±0.00 & \textbf{0.03±0.00} & 0.10±0.00 & 0.07±0.00 & 0.07±0.00 & 0.35±0.00 & 0.37±0.00 & 0.340±0.004 \\ \hline
id\_8 & 0.30±0.00 & 0.08±0.00 & 0.12±0.00 & \textbf{0.03±0.00} & 0.30±0.00 & 0.46±0.00 & 0.45±0.00 & 0.400±0.006 \\ \hline
id\_9 & 0.28±0.00 & \textbf{0.03±0.00} & 0.14±0.00 & 0.13±0.00 & 0.15±0.00 & 0.45±0.00 & 0.48±0.00 & 0.330±0.005 \\ \hline
id\_10 & 0.20±0.00 & \textbf{0.02±0.00} & 0.05±0.00 & 0.04±0.00 & 0.04±0.00 & 0.14±0.00 & 0.36±0.00 & 0.280±0.006 \\ \hline
id\_11 & 0.26±0.00 & \textbf{0.01±0.00} & 0.13±0.00 & 0.15±0.00 & 0.06±0.00 & 0.31±0.00 & 0.32±0.00 & 0.300±0.005 \\ \hline
id\_12 & 0.13±0.00 & \textbf{0.02±0.00} & 0.04±0.00 & 0.04±0.00 & 0.07±0.00 & 0.21±0.00 & 0.22±0.00 & 0.290±0.003 \\ \hline
id\_13 & 0.22±0.00 & 0.05±0.00 & 0.05±0.00 & \textbf{0.04±0.00} & 0.19±0.00 & 0.29±0.00 & 0.40±0.00 & 0.340±0.006 \\ \hline
id\_14 & 0.18±0.00 & \textbf{0.03±0.00} & 0.11±0.00 & 0.10±0.00 & \textbf{0.03±0.00} & 0.11±0.00 & 0.18±0.00 & 0.230±0.004 \\ \hline
id\_15 & 0.23±0.00 & \textbf{0.02±0.00} & 0.05±0.00 & 0.05±0.00 & 0.04±0.00 & 0.19±0.00 & 0.26±0.00 & 0.210±0.003 \\ \hline
Overall & 0.34±0.14 & \textbf{0.10±0.18} & 0.11±0.07 & 0.16±0.17 & 0.17±0.12 & 0.36±0.15 & 0.44±0.16 & 0.400±0.150 \\ \hline
\end{tabular}
\label{tab:sensitivity}
\end{table}

\begin{figure}
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{id_10_ufs.pdf}
            \caption*{US}
	\end{minipage}
	\hfill
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{id_10_0.pdf}
            \caption*{DS\_0.0}
	\end{minipage}
	\hfill
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{id_10_0.25.pdf}
            \caption*{DS\_0.25}
	\end{minipage}\hfill
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{id_10_0.5.pdf}
            \caption*{DS\_0.5}
	\end{minipage}\hfill
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{id_10_0.75.pdf}
            \caption*{DS\_0.75}
	\end{minipage}
    \hfill
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{id_10_1.0.pdf}
            \caption*{DS\_1.0}
	\end{minipage}
	\hfill
 	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{BADGE.pdf}
            \caption*{BADGE}
	\end{minipage}
    \hfill
	\begin{minipage}[t]{0.49\textwidth}
		\centering
		\includegraphics[scale = 0.43]{Coreset.pdf}
            \caption*{Coreset}
	\end{minipage}
	\hfill
    \caption{The percentage of misclassified detected samples, the percentage of detected patterns, and the percentage of queried samples along with queried iterations in dataset id\_10}
    \label{fig:convergence}
\end{figure}

\subsection{Analysis of sampling complexity}
Each iteration in our framework consists of two main phases. The first phase determines which samples to be labeled next, the most costly computation in this phase is the matrix inversion and computing matrix determinant. The maximum size of the matrix is $N$, so the time complexity is $O(N^3)$. If we use the optimized CW-like algorithm for matrix inversion, then the complexity can be as low as $O(N^{2.373})$. The second phase includes updating information and confirming detected failure modes. Updating information involves matrix inversions and multiplications, with cost $O(N^{2.373})$. A low-cost Depth First Search is implemented to check detected failure modes, which costs $O(N)$. In conclusion, the cost of an iteration is $O(N^{2.373})$.

\subsection{Principal hyper-parameters AND user-defined hyper-parameters}

Our proposed framework is applied to human-machine cooperation systems. Therefore, some terms depend on the user such as the failure mode definition which is defined by two factors: (i) how to determine whether two samples have a common concept; (ii) what the structure of a failure pattern is. In our experiments, we consider the case that the user defines an edge (common concept) by using the mutual $k_{nn}$-graph under the Euclidean distance on the embedding space. The connectivity criterion is maximally connected subgraphs (a.k.a.~connected components). With this indication, the user also provides two hyper-parameters $k_{nn}$ and $M$. The meaning of $k_{nn}$ and $M$ are mentioned in Appendix~\ref{sec:dataset}. From the algorithmic viewpoint, our approach depends mainly on one main hyper-parameter $\vartheta$. The parameter $\vartheta$ regulates the exploration-exploitation trade-off in the sampling procedure ($\vartheta = 0$ means pure exploitation, $\vartheta = 1$ means pure exploration). We experimented with five values of $\vartheta$ throughout the paper. 

\section{Proofs}
\subsection{Proofs of Proposition 6.1}
\begin{proof}[Proof of Proposition 6.1]
We first show that the value of $\delta$ should be upper-bounded by $\sqrt{N-1}$. To see this, note that $K(h_{\mc X}, h_{\mc Y})$ is a Gram matrix, so its diagonal elements are all ones, and the off-diagonal elements are in the range $(0, 1]$. We have an upper bound that:
\[
\| K(h_{\mc X}, h_{\mc Y}) - I_N \|_F \leq \sqrt{N(N-1)}.
\]
To ensure the existence of $h_{\mc X}, h_{\mc Y}$, the value of $\delta$ must fulfill:
\[
\delta \| I _N \|_F < \sqrt{N(N-1)}
 \implies 
\delta < \sqrt{N-1}.
\]
Next, we show that condition for $h_{\mc X}$ and $h_{\mc Y}$. Squaring both sides of~\eqref{eq:hyper-condition1} gives
\[
    \| K(h_{\mc X},  h_{\mc Y}) - I_N \|_F^2 \geq \delta^2 \| I _N \|_F^2 = \delta^2 N. 
\]
Because the diagonal elements of $K(h_{\mc X}, h_{\mc Y})$ are all ones, the above condition is equivalent to
\begin{align}
        \label{eq:hyper-condition2}
      &\sum_{i > j} \exp\big( -\frac{\| x_i - x_j \|_2^2}{h_{\mc X}^2} -\frac{\| \msa_{\hat{y}_i} - \msa_{\hat{y}_j} \|_2^2 + \| \covsa_{\hat{y}_i} - \covsa_{\hat{y}_j}\|_F^2}{h_{\mc Y}^2} \big)
      \geq \frac{\delta^2 N}{2}.
\end{align}
Using Jensen inequality for the exponential function, which is convex, we have the following lower bound:
\begin{align*}
&\frac{1}{{N \choose 2}}\sum_{i > j} \exp \big( -\frac{\| x_i - x_j \|_2^2}{h_{\mc X}^2} -\frac{\| \msa_{\hat{y_i}} - \msa_{\hat{y_j}} \|_2^2 + \| \covsa_{\hat{y_i}} - \covsa_{\hat{y_j}}\|_F^2}{ h_{\mc Y}^2} \big)
\\
& \qquad \geq \exp\big(-\frac{\sum_{i > j}\| x_i - x_j \|_2^2}{h_{\mc X}^2 {N  \choose 2}}  - \frac{\sum_{i > j}\| \msa_{\hat{y}_i} - \msa_{\hat{y}_j} \|_2^2 + \| \covsa_{\hat{y}_i} - \covsa_{\hat{y}_j}\|_F^2}{h_{\mc Y}^2 {N \choose 2}} \big).
\end{align*}
Therefore, if $h_{\mc X}$ and  $h_{\mc Y}$ satisfy 
\begin{align*}
\exp\big(-\frac{ \sum{i > j}\| x_i - x_j \|_2^2}{h_{\mc X}^2 {N  \choose 2}} - \frac{ \sum_{i > j}\| \msa_{\hat{y}_i} - \msa_{\hat{y}_j} \|_2^2 + \| \covsa_{\hat{y}_i} \covsa_{\hat{y}_j}\|_F^2}{h_{\mc Y}^2 {N \choose 2}}\big) 
\geq \frac{\delta^2}{N - 1} ,
\end{align*}
then they also satisfy the condition~\eqref{eq:hyper-condition2}. Defining the quantities $D_{\mc X}$ and $D_{\mc Y}$ as in statement of the proposition, we find that $h_{\mc X}$ and $h_{\mc Y}$ should satisfy
\[
\Leftrightarrow \frac{D_{\mc X}}{h_{\mc X}^2} + \frac{D_{\mc Y}}{h_{\mc Y}^2} \leq \ln{\frac{N-1}{\delta^2}}.
\]
This completes the proof.
\end{proof}

\subsection{Taylor Expansion for Value-of-Interest VoI}

We first use a second-order Taylor expansion to approximate $f(X) = \VoI(X) = (1 + \exp(- g(X))^{-1}$ around the point $X=\mu$:
\begin{align*}
    f(X) 
    &= f(\mu) + (X - \mu)^\top \nabla f(\mu) + \frac{1}{2} (X - \mu)^\top \nabla^2 f(\mu) (X - \mu) + \mathcal{O}(\| \Delta_X \|^3) \\
    &= f(\mu) + (X - \mu)^\top \nabla f(\mu) + \frac{1}{2} \mathrm{Tr}[\nabla^2 f(\mu) (X - \mu) (X - \mu)^\top] + \mathcal{O}(\| \Delta_X \|^3).
\end{align*}
Moreover, we set $\mu$ as the expected value $\EE[X]$, and taking expectations on both sides of the above equation gives
\begin{align*}
    \EE[f(X)] 
    &= \EE\big[f(\mu)\big] + \EE\big[(X - \mu)^\top \nabla f(\mu)\big] + \frac{1}{2} \EE\big[\mathrm{Tr}[\nabla^2 f(\mu) (X - \mu) (X - \mu)^\top]\big] + \mathcal{O}(\| \Delta \|^3) \\
    &= f(\mu) + \frac{1}{2} \cov_{t, i}^* \nabla^2 f(\mu) + \mathcal{O}(\| \Delta \|^3),
\end{align*}
where the second equality follows from the relationship
\[
\EE\big[(X - \mu)^\top \nabla f(\mu)\big] = \EE\big[(X - \mu)\big]^\top \nabla f(\mu) = (\EE[X] - \mu)]^\top \nabla f(\mu) = 0,
\]
and from the definition of the covariance matrix
\[
\EE\big[(X - \mu) (X - \mu)^\top\big] = \cov_{t, i}^*.
\]
It now suffices to verify the expressions for $\alpha_i$ and $\beta_i$. Note that $\alpha_i = f(\mu) = (1 + \exp(- \mu))^{-1}$ and $\beta_i$ is the second-order derivative 
\begin{align*}    
    \beta_i &= \nabla^2 f(\mu) = \alpha_i(1-\alpha_i)(1-2\alpha_i) ,
\end{align*}
where the second equality follows from the property of the sigmoid function.

\section{SOCIAL IMPACT}

One important social impact of this research lies in its potential to improve the accuracy and reliability of machine learning classifiers. By identifying misclassification patterns, the framework enables the refinement and improvement of classifiers, reducing the likelihood of wrong predictions in various domains. This can have wide-ranging implications, such as improving the performance of automated systems in critical areas where accurate classification is of utmost importance like healthcare diagnosis~\citep{shaban2021guest, rudin2018optimized, albahri2023systematic}, or autonomous vehicles~\citep{glomsrud2019trustworthy, wagner2015philosophy}.

Another significant social impact of this research is its potential to address biases and fairness issues in machine learning systems~\citep{caton2020fairness, mehrabi2021survey, pessach2022review}. By identifying misclassification patterns, the framework can shed light on potential biases in the data or algorithmic models. This knowledge is crucial for developing fairer and more equitable machine learning systems which are obligatory for bringing machine learning models to practical implementations.

Moreover, the collaborative nature of the framework promotes human-machine interaction, fostering a symbiotic relationship that combines human expertise and algorithmic capabilities. This approach not only empowers human annotators by involving them in the decision-making process but also allows them to contribute their domain knowledge and intuition~\citep{wu2022survey, xin2018accelerating}.

\bibliography{nguyen_632}
\end{document}
