
\documentclass{article} % For LaTeX2e
\usepackage{iclr2024_conference,times}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}
% \ificlrfinal
\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx} 
\usepackage{subcaption}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{wrapfig}
\usepackage{caption}


\newenvironment{thisnote}{\par\color{black}}{\par}
% \newcommand{\ms}[1]{\textcolor{purple}{Mehrdad: #1}}
% \newcommand{\mm}[1]{\textcolor{red}{Mazda: #1}}
% \newcommand{\kr}[1]{\textcolor{orange}{Keivan: #1}}
% \newcommand{\SF}[1]{\textcolor{blue}{SF: #1}}

\newcommand{\method}{\texttt{PRIME}}
% \newcommand{\method}{\textsc{PRIME}}

\title{PRIME: Prioritizing Interpretability in Failure Mode Extraction}
% \title{Tag2Failure: Putting Interpretability First for Diagnosing Failure Modes in Human Terms}

% Authors must not appear in the submitted version. They should be hidden
% as long as the \iclrfinalcopy macro remains commented out below.
% Non-anonymous submissions will be rejected without review.

% \centering
% \author{
% Keivan Rezaei\\
% University of Maryland\\
% College Park, MD\\
% \texttt{krezaei@umd.edu}
% \And
% Mehrdad Saberi\\
% University of Maryland\\
% College Park, MD\\
% \texttt{msaberi@umd.edu}
% \And
% Mazda Moayeri\\
% University of Maryland\\
% College Park, MD\\
% \texttt{mmoayeri@umd.edu}
% \And
% Soheil Feizi\\
% University of Maryland\\ 
% College Park, MD\\
% \texttt{sfeizi@cs.umd.edu}
% }

% \author{Francisco Vargas$^{1}$\thanks{Work done while at DeepMind}~~, Will Grathwohl$^2$ \& Arnaud Doucet$^2$ \\
% $^{1}$ University of Cambridge, $^{2}$ DeepMind
% }

% \author{%
%     Keivan Rezaei\thanks{Equal contribution.}, Mehrdad Saberi\footnotemark[1], Mazda Moayeri, \& Soheil Feizi\\
%     \texttt{\{krezaei,msaberi,mmoayeri,sfeizi\}@umd.edu}\\
%     University of Maryland
% }

\author{Keivan Rezaei$^1$\thanks{Equal contribution.}\ \ , Mehrdad Saberi$^{1*}$, Mazda Moayeri$^{1}$, Soheil Feizi$^{1}$
\vspace{1.5mm} \\
$^1$Department of Computer Science, University of Maryland
\vspace{1.5mm} \\
\small \texttt{\{krezaei,msaberi,mmoayeri,sfeizi\}@umd.edu}\\
}

% \newcommand{\SF}[1]{\textcolor{blue}{SF: #1}}
% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to \LaTeX{} to determine where to break
% the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
% puts 3 of 4 authors names on the first line, and the last on the second
% line, try using \AND instead of \And before the third author name.

% \newcommand{\fix}{\marginpar{FIX}}
% \newcommand{\new}{\marginpar{NEW}}

\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}


\maketitle

\begin{abstract}

% \SF{I rewrote the abstract; your version was commented out for the reference; perhaps we can follow the same flow in introduction?}
In this work, we study the challenge of providing human-understandable descriptions for failure modes in trained image classification models.
Existing works address this problem by first identifying clusters (or directions) of incorrectly classified samples in a latent space and then aiming to provide human-understandable text descriptions for them.
We observe that in some cases, describing text does not match well
with identified failure modes, partially owing to the fact that shared interpretable attributes of failure modes may not be captured using clustering in the feature space.
To improve on these shortcomings, we propose a novel approach that prioritizes interpretability in this problem: we start by obtaining human-understandable concepts (tags) of images in the dataset and
then analyze the model's behavior based on the presence or absence of combinations of these tags.
Our method also ensures that the tags describing a failure mode form a~minimal set,
avoiding redundant and noisy descriptions.
Through several experiments on different datasets, we show that our method successfully identifies failure modes and generates high-quality text descriptions associated with them.
These results highlight the importance of prioritizing interpretability in understanding model failures. 



%In this study, we address the challenge of detecting failure modes in trained models and providing human-understandable descriptions for these challenging subpopulations. Previous methods often require human intervention or lack interpretable descriptions. To improve on these shortcomings, we propose a novel approach that prioritizes interpretability. We start by obtaining human-understandable concepts (tags) of images in the dataset and analyze the model's behavior based on the presence or absence of combinations of these tags. Our method successfully identifies failure modes and generates high-quality descriptions. To evaluate its generalizability, we examine the model's performance on images associated with the detected failure modes. Additionally, we introduce a systematic method to measure the quality of descriptions, which demonstrates the effectiveness of our approach in providing accurate and specific captions for challenging subpopulations. We also compare our method with existing techniques, showing its superiority, and argue for the importance of prioritizing interpretability in understanding model failure.
\end{abstract}

\section{Introduction}
% \mm{can we say something about the danger of having undiagnosed failure modes?}
A plethora of reasons (spurious correlations, imbalanced data, corrupted inputs, etc.) may lead a model to underperform on a specific subpopulation;
we term this a \emph{failure mode}. Failure modes are challenging to identify due to the black-box nature of deep models, and further,
they are often obfuscated by common metrics like overall accuracy, leading to a false sense of security.
However, these failures can have significant real-world consequences, such as perpetuating algorithmic bias \citep{buolamwini2018gender} or unexpected catastrophic failure under distribution shift.
Thus,  the discovery and description of failure modes is crucial in building reliable AI, as we cannot fix a problem without first diagnosing it.
% \kr{Mazda can you add more classic papers here?}

% Training dataset plays a pivotal role in shaping the behavior of a model.
% Spurious correlations, underrepresented subpopulations, or corrupted inputs in the training dataset can
% result in the model exhibiting superior or inferior performance on specific subpopulations of inputs. 
% Hence, identifying these subpopulations allows us to diagnose models and collect more data to obtain more accurate models
% that can generalize better.

Detection of failure modes or biases within trained models has been studied in the literature.
Prior work \citep{tsipras2020imagenet, vasudevan2022does} requires humans in the loop to get a sense of biases or subpopulations on which a model underperforms.
Some other methods \citep{sohoni2020no, nam2020learning, kim2019multiaccuracy, liu2021just} do the process of capturing and intervening in hard inputs without providing \textit{human-understandable} descriptions for challenging subpopulations.
Providing human-understandable and \textit{interpretable} descriptions for failure modes not only enables humans to easily understand hard subpopulations,
but enables the use of text-to-image methods \citep{ramesh2022hierarchical, rombach2022high, saharia2022photorealistic, kattakinda2022invariant}
to generate relevant images corresponding to failure modes to improve model's accuracy over them.
% \mm{odd citation - there are no generative models in FOCUS, do you mean d3s?}

% to refine training and train more accurate and generalizable models.
% \SF{not the strongest opening paragraph; how about we explain why interpretable failure models are important: For example, we can say they are critical for model diagnosis and also we can use them to collect more data about those underperforming subpopulations or we can generate data for them using text-to-image models. For the latter part, you can cite our D3S and Madry's related work}

% \SF{we need to add citations here; also be careful if we are negatively describing a work}
Recent work \citep{eyuboglu2022domino, jain2022distilling, kim2023biastotext, deon2021spotlight} takes an important step in improving failure mode diagnosis
by additionally finding natural language descriptions of detected failure modes, namely via leveraging modern vision-language models.
% With the advent of vision-language models that can map images and text to the same representation space, 
% recent approaches 
% aim to assign human-understandable descriptions to hard subpopulations or \textit{failure modes} \mm{we already used this word; i think its best to define it earlier} of a model trained on a specific dataset \mm{can we just say `recent approaches improve upon prior work by additionally assigning natural language descriptions of detected failure modes'}.
These methodologies leverage the shared vision-language latent space, discerning intricate clusters or directions within this space,
and subsequently attributing human-comprehensible descriptions to them. 
However, questions have been raised regarding \textit{the quality of the generated descriptions}, i.e., 
there is a need to ascertain whether the captions produced genuinely correspond to the images within the identified subpopulation.
Additionally, it is essential to determine whether these images convey shared semantic attributes that can be effectively articulated through textual descriptions.

% \mm{a bit vague -- you mean, is distance in the representation space a good proxy for semantic similarity; We observe that two samples sharing many semantic attributes may in fact lie far away in representation space, while nearby instances may not share any semantics. I'd reference experiments that we have on this;
% \mm{a qualitative example could be insightful as well.}. \kr{suggestions?}
In this work, we investigate whether or not the latent representation space is a good proxy for semantic space.
In fact, we consider two attributed datasets: CelebA\citep{liu2015faceattributes} and CUB-200\citep{WahCUB_200_2011}
and observe that two samples sharing many semantic attributes may indeed lie far away in latent space, while nearby instances may not share any semantics (see Section~\ref{subsec:rev}).
Hence, existing methods may suffer from relying on representation space as clusters and directions found in this space
may contain images with different semantic attributes leading to less coherent descriptions.
% \kr{does this para make sense?}

% \SF{it is a strong negative claim that I think we should avoid; first, most likely these authors will be our reviewers, second, I think such strong claims need a bit more evidence}

% \SF{I'd suggest to write this in a way that I wrote in the abstract; first describe some of these methods; say they are elegant! and then explain that the bottelneck here is working in the representation space of these models, etc. you can even talk about Table 3 results to highlight this}

% \ms{Are there any other benefits to finding \textbf{more human-understandable} failure modes? If refining the training is the only benefit, shouldn't one of our main goals be to provide better results on this task using our failure modes, compared to previous work?}
\input{figures/first_figure/figure}

Inspired by this observation and the significance of faithful descriptions, we propose \method.
In this method, we suggest to reverse the prevailing paradigm in failure mode diagnosis.
That is, we put \textit{interpretability first}.
In our method, we start by obtaining human-understandable concepts (tags) of images using a pre-trained tagging model and
examine model's behavior conditioning on the presence or absence of a~combination of those tags. 
In particular, we
consider different groups of tags and check whether
(1) there is a significant drop in model's accuracy over images that represent all tags in the group
and (2) that group is minimal, i.e., images having only some of those tags are easier images for the model.
When a group of tags satisfies both of these conditions, we identify it as a failure mode which can be effectively described by these tags. Figure~\ref{fig:main_fig} shows the overview of our approach
and compares it with existing methods.
\begin{wrapfigure}{r}{0.35\textwidth}
  \begin{center}
    \includegraphics[width=0.33\textwidth]{figures/intro/digram_new.pdf}
  \end{center}
  \vspace{-10pt}
  \caption{\method{} illustration.} 
  \vspace{-20pt}
  \label{fig:main_fig}
\end{wrapfigure}


% \mm{i tried improving the second condition; i think it can still be made better. I'd think about it more.} \kr{I made a minor change, how does it look?}
% \kr{use group instead of subset}

% \SF{add more details, how do we find combinations; what are the hyperparameters; are these tags essential, etc.}

As an example, by running \method{} on a~trained model over Living17,
we realize that images where a~\textbf{black} ape is \textbf{hanging} from a \textbf{tree branch} identify a hard subpopulation such that model's accuracy drops from $86.23\%$ to $41.88\%$.
Crucially, presence of all $3$ of these tags is necessary, i.e.,
when we consider images that have $1$ or $2$ of these $3$ tags, the accuracy of model is higher.
Figure~\ref{fig:num-of-tags} illustrates these failure modes. We further study the effect of number of tags in Section~\ref{subsec:num-of-tags}.

% \SF{I'd instead mention the Fox or Ape example with three tags; mention accuracy when any of the tags are removed to highlight the essense of our results}

% \input{figures/method/figure}

% \mm{first motivate generalizability, introduce it, then say that we have it: A key advantage/utility/impact of distilling a failure mode into a faithful description is that the resultant description itself can be used to obtain more data that can be used to fix the failure mode. However, the text description must \textit{generalizes} well, in the sense that new images that match the tags in a failure mode must be similarly challenging for the model, so that training on this curated subset results in improved performance.}
To further validate our method, we examine data unseen during the computation of our failure mode descriptions.
We observe that the images that match a failure mode
% images who match group of tags identified as a failure mode
lead the model to similarly struggle.
That is, we demonstrate \textit{generalizability} of our failure modes, crucially, directly from the succinct text descriptions.
While reflecting the quality of our descriptions, this allows for bringing in generative models.
We validate this claim by generating hard images using some of the failure mode's descriptions 
and compare the accuracy of model on them with some other generated images that correspond to easier subpopulations.
% Furthermore, our method extends its utility to leverage the \textit{descriptions of detected failure modes} for the purpose of identifying failure modes in an unseen dataset.
% Unlike existing methods, which may struggle with this task, our approach offers \textit{generalizability}.
% Additionally, our method enables us to use \textit{descriptions of detected failure modes} to obtain failure modes on \textit{unseen dataset}.
% In fact, in existing work, by only having descriptions of detected failure modes, it is unclear how to obtain hard groups over unseen images.
% However, our method is generalizable.
% We accomplish this by obtaining tags over unseen images and by considering tags associated with a failure mode, we collect images representing those tags over unseen dataset,
% thus, obtaining a group of hard images. \kr{it is generalizable from text. collecting more data for failure modes}\mm{rewrite; unclear}
% We empirically show that our method generalizes well, i.e., description of detected failure modes identify hard subpopulations over unseen data.
% Moreover, these descriptions possess the ability to \textit{generate challenging images} on which the model's accuracy significantly deteriorates. 

% \SF{It is a bit hard to digest different evaluatiion criteria we have. Perhaps we can make it a bit more organized? will discuss it in our meeting}
% \mm{transition is very abrupt.}
% \kr{similarity, coherence and specificity}
% \kr{better w.r.t. than prior work that do not put inter first ...}
Finally, we show that \method{} produces better descriptions for detected failure modes in terms of \textit{similarity}, \textit{coherency}, and \textit{specificity} of descriptions,
compared to prior work that does not prioritize interpretability. 
Evaluating description quality is challenging and typically requires human assessment, which can be impractical for extensive studies.
To mitigate that, inspired by CLIPScore \citep{Hessel2021CLIPScoreAR}, we present a suite of three automated metrics that harness vision-language models to evaluate the quality.
These metrics quantify both the intra-group image-description similarity and coherency, while also assessing the specificity of descriptions to ensure they are confined to the designated image groups.
We mainly observe that due to putting interpretability first and considering different combinations of tags (concepts),
we observe improvements in the quality of generated descriptions.
We discuss \method 's limitations in Appendix~\ref{app:limitation}.

% where we mainly focus on 
% (1) generalizability of the approach and (2) quality of outputted descriptions.
% We evaluate our proposed approach using this framework and compare the method with other existing ones. \SF{I don't quite understand what you mean here}
% We are also able to evaluate its generalizability by providing images of bears having those tags.


% In order to examine the \textit{generalizability} of a human-understandable failure mode detector,
% we split the dataset into train and test and run the method to extract failure modes on the train split.
% We then utilize generated captions and find group of images in the test dataset described by those captions and evaluate model's performance on them.
% A good description for a failure mode should lead to a hard group of unseen images. 
% We use this idea to compare generalizability of our approach and other comparable existing work.


% Furthermore, we rely on descriptions we obtain for detected failure modes to generate images with the same attributes.
% We then expect the model to underperform on images generated with the help of failure modes' captions than normally generated images.
% This not only shows that failure modes generalize well but also indicates that descriptions are accurate enough to generate hard inputs.



% we compare our method with the prominent recent method DOMINO\citep{eyuboglu2022domino} that improves over similar approaches. DOMINO detects failure modes using the latent representation of inputs in a vision-language space and assign human-understandable meaning to them. We provide a systematic method to measure the quality of descriptions and see improvements over existing work in our method, i.e., when we put inpterpretability first.

% In evaluating the ability of our generated captions to describe challenging images, we compare our method with the approach proposed by \cite{jain2022distilling}. Their method focuses on detecting hard directions in the latent space of a vision-language model and assigning meaningful descriptions to identified directions. While their method may provide more detailed descriptions, it may not perform well in cases where there is not a single failure direction, or when failure modes exist in multiple clusters. In contrast, our approach takes into account the diverse nature of failure modes by considering combinations of tags rather than a single failure direction. This allows us to capture a broader range of failure modes that may exist within the dataset. By adopting a comprehensive perspective, we aim to generate descriptions that accurately represent the complex nature of failure modes in a more generalized manner.

% Finally,
% we motivate that putting interpretability first could be a wise choice as models that detect tags, concepts, or objects in the images continue to evolve.
% Also, we empirically show that the reverse direction used in recent approaches where
% the latent representation of a~vision-language model is utilized to
% detect hard clusters inevitably generates low-quality descriptions as those latent spaces are not a good proxy for human-understandable (semantic) features in images.
% By focusing on interpretability and extracting human-understandable concepts directly from the images,
% we ensure that the descriptions we generate are accurate and meaningful.
% This approach allows us to effectively capture and describe failure modes in a manner that aligns with human comprehension.
% \SF{I'd somehow merge this paragraph to the first part when we motivate our approach}

\textbf{Summary of Contribution.}  %In this paper, we address the challenge of explaining a model's failure modes in human understandable terms: 
\begin{enumerate}
    \item We propose \method{} to extract and explain failure modes of a model in human-understandable terms by prioritizing interpretability.
    \item
    Using a suite of three automated metrics to evaluate the quality of generated descriptions, we observe improvements in our method compared to strong baselines such as \cite{eyuboglu2022domino} and \cite{jain2022distilling} on various datasets.
    \item We advocate for the concept of putting interpretability first by providing empirical evidence derived from latent space analysis, suggesting that distance in latent space may at times be a misleading measure of semantic similarity for explaining model failure modes.
\end{enumerate}
    
\input{figures/ape/figure}

\section{Literature Review}
\textbf{Failure mode discovery.}
The exploration of biases or challenging subpopulations within datasets,
where a model's performance significantly declines,
has been the subject of research in the field.
Some recent methods for detecting such biases rely on human intervention, which can be time-consuming and impractical for routine usage.
For instance, recent works \citep{tsipras2020imagenet, vasudevan2022does} depend on manual data exploration to identify failure modes in widely used datasets like ImageNet.
Another line of work uses crowdsourcing \citep{nushi2018towards, idrissi2022imagenet, plumb2021finding} or simulators \citep{leclerc20223db} to label visual features, but these methods are expensive and not universally applicable. 
Some researchers utilize feature visualization \citep{engstrom2019adversarial, olah2017feature} or saliency maps \citep{selvaraju2017grad, adebayo2018sanity}
to gain insights into the model's failure, but these techniques provide information specific to individual samples and lack aggregated knowledge across the entire dataset.
Some other approaches \citep{NEURIPS2020bias1, nam2020learning, liu2021just, hashimoto2018fairness} automatically identify failure modes of a model but do not provide human-understandable descriptions for them.

Recent efforts have been made to identify difficult subpopulations and assign human-understandable descriptions to them
\citep{eyuboglu2022domino, jain2022distilling, kim2023biastotext}.
DOMINO \citep{eyuboglu2022domino} uses the latent representation of images in a vision-language model to cluster difficult images and then assigns human-understandable descriptions to these clusters.  \cite{jain2022distilling} identifies a failure direction in the latent space and assigns description to images aligned with that direction.
\textcolor{black}{\cite{Hoffmann2021ThisLL} shows that there is a semantic gap between similarity in latent space and similarity in input space, which can corrupt the output of methods that rely on assigning descriptions to latent embeddings.}
In \citep{kim2023biastotext}, concepts are identified whose presence in images leads to a substantial decrease in the model's accuracy.
\textcolor{black}{Recent studies \citep{johnson2023does, gao2023adaptive} highlight the challenge of producing high-quality descriptions in the context of failure mode detection.}
% \kr{feel free to add any more relevant citations (possibly your papers)!}

% \mm{I think bias2text warrants more discussion, but this is a minor point}
% \kr{can we cite text-to-concept and other of our work somewhere?}

% Prior work 
% Detecting hard subpopulations without providing human-understandable methods. \citep{NEURIPS2020bias1, nam2020learning, liu2021just, hashimoto2018fairness}.
% Human-in-the-loop 
% Crowdsourcing \citep{nushi2018towards, idrissi2022imagenet, plumb2021finding}.
% Simulators \citep{leclerc20223db},
% Feature Visualization \citep{engstrom2019adversarial, olah2017feature}.
% Saliency Map \citep{selvaraju2017grad, adebayo2018sanity}
% MILAN and other \citep{hernandez2021natural, wiles2022discovering}

% \textbf{Image generation.}
% \kr{Mehrdad please add some content here.}

\textbf{Vision-Language and Tagging models.}
Vision-language models have achieved remarkable success through pre-training on large-scale image-text pairs \citep{radford2021learning}.
These models can be utilized to incorporate vision-language space and evaluate captions generated to describe images.
Recently \cite{moayeri2023text, li2023blip} bridge the modality gap and enable off-the-shelf vision encoders to access shared vision-language space.
Furthermore, in our method, we utilize models capable of generating tags for input images \citep{huang2023tag2text, zhang2023recognize}.

% \newcommand{\Dtrain}{D_{\text{train}}}
\newcommand{\Dtrain}{\mathcal{D}}
\newcommand{\Dtest}{\mathcal{D'}}

\section{Extracting Failure Modes by Conditioning on Human-understandable Tags}
\label{sec:method}


Undesirable patterns or spurious correlations within the training dataset can lead to performance discrepancies in the learned models.
For instance, in the Waterbirds dataset \citep{waterbirds}, images of landbirds are predominantly captured in terrestrial environments such as forests or grasslands.
Consequently, a model can heavily rely on the background and make a prediction based on that.
Conversely, the model may also rely on cues such as the presence of the ocean, sea, or boats to identify the input as waterbirds.
This can result in performance drops for images where a waterbird is photographed on land or a landbird is photographed at sea.
Detecting failure modes involves identifying groups of inputs where the model's performance significantly declines.
While locating failure inputs is straightforward, \textit{categorizing} them into distinct groups characterized by \textit{human-understandable concepts} is a challenging task.
To explain failure modes, we propose \method. 
Our method consists of two steps: (I) obtaining relevant tags for the images, and (II) identifying failure modes based on extracted tags.

\subsection{Obtaining Relevant Tags}
\label{sec:rel-tags}
% \mm{not perfectly clear what is in this section aside from just applying RAM. the reliability only comes from having a size requirement? seems like making sure the set is minimal is in the next subsection}
% \kr{I changed the title, does it now make sense?}
We start our method by collecting concepts (tags) over the images in the dataset.
For example, for a photo of a fox sampled from ImageNet \citep{deng2009imagenet},
we may collect tags ``orange'', ``grass'', ``trees'', ``walking'', ``zoo'', and others.
To generate these tags for each image in our dataset, we employ the state-of-the-art \textit{Recognize Anything Model (RAM)} \citep{zhang2023recognize, huang2023tag2text},
which is a model trained on image-caption pairs to generate tags for the input images.
RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy.
% We also propose another way of attaining tags in the images using vision-language models such as CLIP in the Appendix. \kr{should we keep this part?}

Let $\Dtrain$ be the set of all images. We obtain tags over all images of $\Dtrain$.
Then, we analyze the effect of tags on prediction in a class-wise manner.
In fact, the effect of tags and patterns on the model's prediction depends on the main object in the images,
e.g., presence of water in the background improves performance on images labeled as waterbird while degrading performance on landbird images.
For each class $c$ in the dataset, we take the union of tags generated by the model over images of class $c$.
Subsequently, we eliminate tags that occur less frequently than a predetermined threshold. This threshold varies depending on the dataset size, specifically set at $50$, $100$, and $200$ in our experimental scenarios.
In fact, we remove rare (irrelevant) tags and obtain a set of tags $T_c$ for each class $c$ in the dataset,
e.g., $T_c = \{``\textnormal{red}", ``\textnormal{orange}", ``\textnormal{snow}", ``\textnormal{grass}", ...\}$.

\subsection{Detecting Failure Modes}
\label{sec:detection}

After obtaining tags, we mainly focus on tags whose presence in the image leads to a performance drop in the model.
Indeed, for each class $c$, we pick a subset $S_c \subseteq T_c$ of tags and evaluate the model's performance on the images of class $c$ including all tags in $S_c$.
We denote this set of images by $I_{S_c}$. $I_{S_C}$ is a coherent image set in the sense that those images share at least the tags in $S_c$.
% Images in $I_{S_c}$ look similar in terms of human recognition \mm{odd wording; maybe say } as they all share tags in $S_c$.


% We note that when $|S_c|$ becomes higher, we get images with more common attributes, thus, we obtain more detailed captions for corresponding group of images.

For $I_{S_c}$,
% the set images that have all tags in $S_c$,
to be a \textit{failure mode}, we require that the model's accuracy over images of $I_{S_c}$ significantly drops,
i.e., denoting the model's accuracy over images of $I_{S_c}$ by $A_{S_c}$
and the model's overall accuracy over the images of class $c$ by $A_c$,
then $A_{S_c} \leq A_c - a$.
Parameter $a$ plays a pivotal role in determining the severity of the failure modes we aim to detect.
Importantly, we want the tags in $S_c$ to be minimal, i.e., none of them should be redundant.
In order to ensure that, we expect that the removal of any of tags in $S_c$ determines a relatively easier subpopulation.
In essence, presence of all tags in $S_c$ is deemed essential to obtain that hard subpopulation.

% Hence, it is a hyperparameter in our method and we pick this parameter based on the trained model and corresponding dataset. \mm{last sentence should be rewritten or excluded} \kr{why? i think i should mention what this parameter $a$ is.}

More precisely, Let $n$ to be the cardinality of $S_c$, i.e., $n = |S_c|$.
We require all tags $t \in S_c$ to be necessary.
i.e., if we remove a~tag $t$ from $S_c$, then the resulting group of images should become an easier subpopulation.
More formally, for all $t \in S_c$,  $A_{S_c \setminus t} \geq A_{S_c} + b_n$ where $b_2, b_3, b_4, ...$
are some hyperparameters that determine the degree of necessity of appearance of all tags in a~group.
We generally pick $b_2 = 10\%$, $b_3 = 5\%$ and $b_4 = 2.5\%$ in our experiments.
% For instance, $b_2 = 10\%$ implies all tags in a group of $2$ tags should bring at least $10\%$ more accuracy drop.
These values help us fine-tune the sensitivity to tag necessity and identify meaningful failure modes. 
% \mm{how is order determined?} \kr{i didn't quite understand your point.}
% \kr{explain one example.}
% Furthermore, we expect the group of images $I_{S_c}$ to have the cardinality of at least $s$ to be reliable and generalizable.
% In fact, we should have a significant number of samples where the model's performance drops to claim the existence of a failure mode.
% By increasing $s$, we find lower number of failure modes while detected failure modes will be more generalizable.
% However, decreasing the value of $s$ yields to more diverse set of failure modes that some of them may not be significant.
Furthermore, we require a minimum of $s$ samples in $I_{S_c}$  for reliability and generalization.
This ensures a sufficient number of instances where the model's performance drops,
allowing us to confidently identify failure modes.
% Adjusting the value of $s$ affects the number and generality of detected failure modes: higher $s$ means fewer but more general failure modes,
% while lower $s$ results in more diverse but potentially less significant failure modes.
Figure~\ref{fig:teaser} shows some of the obtained failure modes.
% \kr{should i give an example with assigning concrete values to hyperparams?} \mm{No i think its ok, you did a good job of explaining it of what the hyperparam does and it is intuitive.}

% As an example, $b_2 = 10\%$, $|S_c| = 2$, $s = 30$, and $a = 40\%$ determines that subset $S_c$ of size $2$ is detected as failure mode if (1) $A_{S_c} \leq A_c - 40\%$,
% (2) $|I_{S_c}| \geq 30$, and (3) for all subsets $S' \subset S_c$ such that $|S'| = 1$, $A_{S'} \geq A_{S_c} + 10\%$.
% This indicates that appearance of both tags in $S_c$ is necessary.

\textbf{How to obtain failure modes.} We generally use \textit{Exhaustive Search} to obtain failure modes.
In exhaustive search, we systematically evaluate various combinations of tags to identify failure modes,
employing a brute-force approach that covers all possible combinations of tags up to $l$ ones. 
More precisely, we consider all subsets $S_c \subseteq T_c$ such that $|S_c| \leq l$ and evaluate the model's performance on $I_{S_c}$.
As mentioned above, we detect $S_c$ as a failure mode if (1) $|I_{S_c}| \geq s$, (2) model's accuracy over $I_{S_c}$ is at most $A_c - a$,
and (3) $S_c$ is minimal, i.e., for all $t \in S_c$, $A_{S_c \setminus t} \geq A_{S_c} + b_{|S_c|}$.
It is worth noting that the final output of the method is all sets $I_{S_c}$ that satisfy those conditions and \textbf{description} for this group
consist of \textbf{class name ($c$)} and \textbf{all tags in $S_c$}.

We note that the aforementioned method runs with a complexity of $O\left(|T_c|^l |\Dtrain|\right)$.
However, $l$ is generally small, i.e., for a failure mode to be generalizable, we mainly consider cases where $l \leq 4$.
Furthermore, in our experiments over different datasets $|T_c| \approx 100$, thus, the exhaustive search is relatively efficient.
% Furthermore, we use branch-and-bound to search through the space of combinations which makes the overall execution more efficient.
For instance, running exhaustive search ($l=4, s=30, a=30\%$) on Living17 dataset having $17$ classes with $88400$ images results in obtaining $132$ failure modes
within a time frame of under $5$ minutes.
\textcolor{black}{We refer to Appendix~\ref{sec:greedy} for more efficient algorithms and Appendix~\ref{app:hyperparams} for more detailed explanation of \method{}'s hyperparameters.}
% \mm{if you can say how long it takes (``for example, we detect X failure modes on a K-class dataset with Y training images in only Z minutes''), this could be very helpful -- complaining ab exhaustive search is an easy thing for reviewer to do. You can also maybe mention that you also developed a greedy search, which we detail in appendix, but ultimately the exhaustive search was sufficiently fast.} \kr{I think now it is okay?}

% However, we also propose another heuristic method that does not cover all possibilities but can explore the space in a more detailed manner by involving more than $l$ tags in the subset of tags it considers. We call this method \textit{Greedy Search}. \kr{Some explanations and some results for Greedy Search ...}

% In the next section, we discuss the generalizability of our approach. We want the failure modes detected by our method to generalize well, i.e., the model's performance significantly drops on unseen images when those images are detected to have a particular subset of tags.


\textbf{Experiments and Comparison to Existing Work.}
We run experiments on models trained on
% some classes of ImageNet \mm{unclear, which classes and why?} \kr{{I think i can remove, i ran experiments on ImageNet just for visualization}} \citep{deng2009imagenet}, 
Living17, NonLiving26, Entity13 \citep{santurkar2020breeds}, 
Waterbirds \citep{waterbirds}, and CelebA \citep{liu2015faceattributes} (for age classification).
We refer to Appendix~\ref{sec:app-training} for model training details and the different hyperparameters we used for failure mode detection.
We refer to Appendix~\ref{app:details} for the full results of our method on different datasets.
We engage two of the most recent failure mode detection approaches DOMINO\citep{eyuboglu2022domino} and Distilling Failure Directions\citep{jain2022distilling} as strong baselines
and compare our approach with them.

\section{Evaluation}


% The output of a~human-understandable failure mode extractor on this dataset is a set of images with their corresponding captions.
% More formally, let $I_1$, $I_2$, ..., $I_m$ be the group of inputs on which the model does not perform well.
% For example, $I_j$ includes some images that usually have similar visual attributes and model's accuracy over these images drop.
% Besides, we let $T_1, T_2, ..., T_m$ to be the corresponding captions for each group of images. Indeed, $T_j$ describes images in $I_j$.
% We note that the value of $m$ (number of detected failure modes) usually depends on the hyperparameters of the approach,
% e.g.,
% minimum accuracy drop ($a$), values for $b_2, b_3, ...$ and group minimum size ($s$) in our method.
Let $\Dtrain$ be the dataset on which we detect failure modes of a~trained model.
The result of a~human-understandable failure mode extractor on this dataset consists of sets of images, denoted as $I_1, I_2, ..., I_m$, along with corresponding descriptions, labeled as $T_1, T_2, ..., T_m$.
Each set $I_j$ comprises images that share similar attributes, leading to a noticeable drop in model accuracy.
Number of detected failure modes, $m$, is influenced by various hyperparameters, e.g., in our method, minimum accuracy drop ($a$), values for $b_2, b_3, ...$, and the minimum group size ($s$) are these parameters.

One of the main goals of detecting failure modes in human-understandable terms is to generate high-quality captions for hard subpopulations.
We note that these methods should also be evaluated in terms of coverage, i.e.,
what portion of failure inputs are covered along with the performance gap in detected failure modes.
All these methods extract hard subpopulations on which the model's accuracy significantly drops, 
and coverage depends on the dataset and the hyperparameters of the method,
thus, we mainly focus on generalizability of our approach and quality of descriptions.
% \mm{you discuss more than just generalizability in this section (e.g. clip score for description quality)! i'd try to give more of an overview of what is to come in the section so that the flow is better.} \kr{didn't quite get the point.}

\subsection{Generalization on Unseen Data}
\label{subsec:gen_unseed}

In order to evaluate generalizability of the resulted descriptions,
we take dataset $\Dtest$ including unseen images and recover relevant images in that
to each of captions $T_1, T_2, ..., T_m$, thus, obtaining $I'_1, I'_2, ..., I'_m$.
Indeed, $I'_j$ includes images in $\Dtest$ that are relevant to $T_j$.
If captions can describe hard subpopulations, then we expect hard subpopulations in $I'_1, I'_2, ..., I'_m$.
Additionally, since $\Dtrain$ and $\Dtest$ share the same distribution, we anticipate the accuracy drop in $I_j$ to closely resemble that in $I'_j$.
% To evaluate the generalizability of our approach in this framework, 
% we run the method explained in Section~\ref{sec:method} on $\Dtrain$
% to obtain failure modes and evaluate model's performance on them over images of 
% $\Dtest$.

In our method, for a detected failure mode $I_{S_c}$, we obtain $I'_{S_c}$ by collecting images of $\Dtest$
that have all tags in $S_c$. 
For example, if appearance of tags ``black", ``snowing", and ``forest" is detected as a failure mode for class ``bear", 
we evaluate model's performance on images of ``bear" in $\Dtest$ that include those three tags, expecting a~significant accuracy drop for model on those images. 
As seen in Figure~\ref{fig:generalization}, \method{} shows a good level of generalizability.
% \mm{rewrite these two sentences} \kr{i made a minor change.}
We refer to Appendix~\ref{sec:app-params} for generalization on other datasets with respect to different hyperparameters ($s$ and $a$). While all our detected failure modes generalize well,
we observe stronger generalization when using more stringent hyperparameter values (high $s$ and $a$), though it comes at the cost of detecting fewer modes.
% \kr{we previously mentioned that before.}

\input{figures/generation/figure}

% We note that in other existing methods \citep{eyuboglu2022domino, jain2022distilling}, 
% no way is presented to evaluate the generalization directly from text. Indeed,
% by having access to description of detected failure modes,
% it is unclear how to obtain hard subpopulations over unseen images.
% We refer to Appendix~\ref{app:dom_gen} for more details on that.


In contrast, existing methods \citep{eyuboglu2022domino, jain2022distilling} do not provide a direct way to assess generalization from text descriptions alone. See Appendix~\ref{app:dom_gen} for more details.

% \input{figures/generation/figure}

% \subsection{Image Generation}

% By obtaining detailed tags for each failure mode, we leverage language models  to generate captions that describe objects and other tags found in the images. These captions can serve as prompts for text-to-image generative models, enabling the synthesis of artificial images specific to the failure modes. To accomplish this, we employ the method outlined in \citep{vendrow2023dataset}, which utilizes a denoising diffusion model \citep{ho2020denoising, rombach2022high}. The process involves fine-tuning a text embedding using a collection of input images and generating similar images during inference by employing the trained embedding. For each class within the Living17 dataset, we fine-tune an embedding and replace the text embedding of the class name (e.g., "parrot") in the failure mode description with the fine-tuned embedding.

% Figure~\ref{fig:gen_example_fig_living17} demonstrates that the generated images generally adhere to the patterns and tags present in real images. This result is primarily attributed to the detailed descriptions produced by our method for each failure mode.


% These synthetic images serve as a means to evaluate the failure modes. Typically, we anticipate that the classifier's accuracy on generated images using failure mode descriptions will be lower compared to images generated using the base class description. For both our method and \citep{jain2022distilling}, we generate 50 images per failure mode, as well as for the base descriptions (i.e., "a photo of a $\langle$class\_name$\rangle$") of each class. The accuracy drop of the output failure modes for both methods, w.r.t the accuracies on generated images with the base descriptions, is illustrated in Figure~\ref{fig:gen_drop_acc_fig_living17}. For this figure, among the failure modes that contain at least 30 \ms{change its format} images in the Living-17 training dataset, we filtered out failure modes with at least two common tags to get more diverse failure modes, and chose at most 3 of them with the highest accuracy drops on training data of Living-17. Because of the filtering of failure modes based on the number of their images, some classes can have less than 3 output failure modes.  class. \ms{Why are we not considering domino}
% Note that while our method has a better performance compared to \citep{jain2022distilling}, it also generates more failure modes per
% \kr{we refer to Appendix~\ref{} for more details on Image Generation -:?}

\newcommand{\simi}{\text{sim}}

\subsection{Generalization on Generated Data}
\label{subsec:generated_data}

\textcolor{black}{
In this section we validate \method{} on synthetic images.
To utilize image generation models, we employ language models to create descriptive captions for objects and tags associated with failure modes in images.
We note that use of language models is just for validation on synthetic images, it is not a part of \method{} framework.
We discuss about limitations of using language models in Appendix~\ref{app:limitation}.
}
These captions serve as prompts for text-to-image generative models, enabling the creation of artificial images that correspond to the identified failure modes.
To achieve this, we adopt the methodology outlined in \cite{vendrow2023dataset}, which leverages a denoising diffusion model \citep{ho2020denoising, rombach2022high}.
We fine-tune the generative model on the Living17 dataset to generate images that match the distribution of the data that the classifiers is trained on.

For each class in Living17 dataset, we apply our approach to identify two failure modes (hard subpopulations) and two success modes (easy subpopulations).
We then employ ChatGPT\footnote{ChatGPT 3.5, August 3 version} to generate descriptive captions for these groups.
Subsequently, we generate $50$ images for each caption and assess the model's accuracy on these newly generated images.
We refer to Appendix~\ref{subsec:image-gen} for more details on this experiment and average discrepancy in accuracy between the success modes and failure modes which further validates \method{}. 
Figure~\ref{fig:generation_imgs} provides both accuracy metrics and sample images for three hard and three easy subpopulations. 

% Although the generated images may not always precisely match the input caption, and their reliability is not absolute, the substantial accuracy gap between success and failure mode images
% confirms the overall effectiveness of those identified failure modes as well as the ability of our approach to describe detailed images.
% We refer to Appendix~\ref{subsec:image-gen} for more details on this section. \kr{talk about class dog}

\subsection{Quality of Descriptions}
\label{subsec:qual_captions}
\begin{thisnote}

Within this section, our aim is to evaluate the quality of the descriptions for the identified failure modes. In contrast to Section~\ref{subsec:generated_data}, where language models were employed to create sentence descriptions using the tags associated with each failure mode, here we combine tags and class labels in a bag-of-words manner. For instance, when constructing the description for a failure mode in the ``ape" class with the tags ``black" + ``branch," we formulate it as "a photo of ape black branch". We discuss more about it in Appendix~\ref{app:limitation}.

% In this section, we do not use language models to generate descriptions for failure modes detected by \method{}.
% In fact, we put tags and class label in a bag-of-word manner to obtain and evaluate descriptions.
% For instance, to get the description of a failure mode for class ``ape" with tags ``black" + ``branch",
% we use “a photo of ape black branch” as its description.    
\end{thisnote}

In order to evaluate the quality of descriptions,
we propose a suite of three complementary automated metrics that utilize vision-language models (such as CLIP)
as a proxy to obtain image-text similarity \citep{Hessel2021CLIPScoreAR}. 
Let $t$ be the failure mode's description, $f_{\text{text}}(t)$ denote the normalized embedding of text prompt $t$ and
$f_{\text{vision}}(x)$ denote the normalized embedding of an image $x$. 
The similarity of image $x$ to this failure mode's description $t$ is the dot product of image and text representation in shared vision-language space.
More precisely, 
$\simi(x, t) := \langle f_{\text{vision}}(x), f_{\text{text}}(t)\rangle$. 
% \begin{align*}
% \simi(x, t) := .
% \end{align*}

For a high-quality failure mode $I_j$ and its description $T_j$, we wish $T_j$ to be similar to images in $I_j$, thus,
we consider the average \textit{similarity} of images in $I_j$ and $T_j$.
we further expect a~high~level of \textit{coherency} among all images in $I_j$, i.e., these images should all share multiple semantic attributes described by text, thus, we wish the standard deviation of similarity scores between images in $I_j$ and $T_j$ to be low.
Lastly, we expect generated captions to be \textit{specific}, capturing the essence of the failure mode, without including distracting irrelevant information.
That is, caption $T_j$ should only describe images in $I_j$ and not images outside of that.
As a result, we consider the AUROC between the similarity score of images inside the failure mode $(I_j)$ and some randomly sampled images outside of that.
We note that in existing methods as well as our method,
all images in a failure mode have the same label,
so we sample from images outside of the group but with the same label.

In Figure~\ref{fig:auroc}, we show (1) the average similarity score, i.e., for all $I_j$ and $x \in I_j$, we take the mean of $\simi(x, T_j)$,
(2) the standard deviation of similarity score, i.e., the standard deviation of $\simi(x, T_j)$ for all $I_j$ and $x \in I_j$, and
(3) the AUROC between the similarity scores of images inside failure modes to their corresponding description and
some randomly sampled images outside of the failure mode to that. 
As shown in Figure~\ref{fig:auroc}, \method{} improves over DOMINO \citep{eyuboglu2022domino} in terms of all
AUROC, average similarity, and standard deviation on different datasets. 
It is worth noting that this improvement comes even though DOMINO chooses a~text caption for the failure mode \textit{\textbf{to maximize the similarity score in latent space}}.
We use hyperparameters for DOMINO to obtain fairly the same number of failure modes detected by \method{}.
Results in Figure~\ref{fig:auroc} show that \method{} is better than DOMINO in the descriptions it provides for detected failure modes.
In Appendix~\ref{app:qual} we provide more details on these experiments.
% and 
% in Appendix~\ref{app:dom_out}, we present some of DOMINO's outputs for reference.
Due to the limitations of \cite{jain2022distilling} for automatically generating captions, we cannot conduct extensive experiments on various datasets.
More details and results on that can be found in Appendix~\ref{subsec:madry-quality}.

% The process involves fine-tuning a text embedding using a collection of input images and generating similar images during inference by employing the trained embedding. For each class within the Living17 dataset, we fine-tune an embedding and replace the text embedding of the class name (e.g., "parrot") in the failure mode description with the fine-tuned embedding.


% In addition to generalization on the test data, we evaluate the quality of our failure mode captions on new generated images from the same distribution. To generate images from the same distribution as our training data using denoising diffusion models \citep{NEURIPS2020_4c5bcfec}, we leverage the method used in \cite{vendrow2023dataset} and fine-tune a pre-trained diffusion model on Living17 images. Afterwards, for each class of Living17, we identify two failure modes and two success modes using our method, and generate a caption 



\section{On Complexity of Failure Mode Explanations}

% \input{figures/num_of_tags/figure}

% \input{figures/num_of_tags/table}

\input{figures/detailed_descriptions/updated_figure}
% \input{figures/detailed_descriptions/figure}

We note that the main advantage of our method is its more faithful interpretation of failure modes.
This comes due to
(1) putting interpretability first, i.e., we start by assigning interpretable tags to images and then recognize hard subpopulations and
(2) considering combination of several tags which leads to a higher number of attributes (tags) in the description of the group.

\subsection{Do We Need to Consider Combination of Tags?}
\label{subsec:num-of-tags}
We shed light on the number of tags in the failure modes detected by our approach.
We note that unlike Bias2Text \citep{kim2023biastotext} that finds biased concepts on which model's behavior changes,
we observe that sometimes appearance of several tags (concepts) all together leads to a severe failure mode.
As an example, we refer to Table~\ref{tab:num-of-tags-exp}
where we observe that appearance of all $3$ tags together leads to a significant drop while single tags and pairs of them show relatively better performance.

\input{figures/num_of_tags/table}

% Furthermore, in the procedure we use to obtain failure modes, we require tags to be necessary.
% In fact, for a detected failure mode we know that the removal of any of the tags leads to an easier subpopulation.
% Hence, having more tokens in a failure mode not only brings more detailed description but also describes harder subpopulations.
In \method{}, we emphasize the necessity of tags.
Specifically, for any detected failure mode, the removal of any tag would result in an easier subpopulation.
Consequently, failure modes with more tags not only provide more detailed description of their images but also characterize more challenging subpopulations.
% \kr{is this misleading?}
% Table~\ref{tab:num-of-tags} shows the average accuracy drop on unseen images for groups identified by $3$ tags
% as well as the average accuracy drop on groups identified by a subset of $2$ tags or a single tag of those failure modes.
% We do see a~significant difference between these numbers which validates the idea that
% involving more tags leads to detecting harder subpopulations.
Table~\ref{table:number-of-tags2} presents the average accuracy drop on unseen images for groups identified by three tags,
compared to the average accuracy drop on groups identified by subsets of two tags or even a single tag
from those failure modes.
These results clearly demonstrate that involving more tags leads to the detection of more challenging subpopulations.
\newpage
% \subsection{Do the Clustering-Based Methods using Vision-Language Feature Space Yield to High-Quality Descriptions?}
\subsection{Clustering-Based Methods may Struggle in Generating Coherent Output}
\label{subsec:rev}
\begin{wraptable}{r}{0.5\textwidth}
\caption{Statistics of the distance between two points in CelebA conditioned on number of shared tags.
Distances are reported using CLIP ViT-B/16 representation space.
The last column shows the probability that the distance between two sampled images with at least $d$ common tags
be more than that of two randomly sampled images.
}
\label{tab:celeba-stats}
% \vskip -0.15in
\centering
\resizebox{1\linewidth}{!}{
\begin{tabular}{c|c|c|c}
\toprule
$\#$ of shared tags $\geq d$ & mean & standard deviation & Probability\\
\midrule
\hline
$d = 0$ & 9.49 & 0.98 & 0.50\\ \hline
$d = 1$ & 9.47 & 1.00 & 0.49\\ \hline
$d = 3$ & 9.23 & 1.00 & 0.42\\ \hline
$d = 5$ & 8.89 & 1.21 & 0.34 \\ \hline
$d = 7$ & 8.32 & 1.80 & 0.25\\ \hline
\bottomrule
\end{tabular}}
\vskip -0.1in
\end{wraptable}

We empirically analyze the reverse direction of detecting human-understandable failure modes.
We note that in recent work where the goal is to obtain interpretable failure modes, %\citep{jain2022distilling, eyuboglu2022domino, deon2021spotlight},
those groups are found by clustering images in the latent space.
Then, when a group of images or a direction in the latent space is found, these methods leverage the shared space of vision-language to find the text that best describes the images inside the group.


% \input{figures/annots/rev}
We argue that these approaches, based on distance-based clusters in the representation space, may produce less detailed descriptions.
This is because the representation space doesn't always align perfectly with the semantic space. Even points close to each other in the feature space may differ in certain attributes, and conversely, points sharing human-understandable attributes may not be proximate in the feature space. 
Hence, these approaches cannot generate high-quality descriptions as their detected clusters in the representation space may contain images with other semantic attributes.

% In order to empirically evaluate this idea, we consider two attributed datasets CelebA \citep{liu2015faceattributes} and CUB-200 \citep{WahCUB_200_2011}. CelebA includes $40$ human-understandable tags for each of the images.
% CUB-200 is a dataset of birds that contains $312$ tags for each of the images.
% We note that all those tags refer to semantic attributes.
To empirically test this idea,
we use two attribute-rich datasets: CelebA \citep{liu2015faceattributes} and CUB-200 \citep{WahCUB_200_2011}.
CelebA features $40$ human-understandable tags per image, while CUB-200, a dataset of birds, includes $312$ tags per image, all referring to semantic attributes.
We use CLIP ViT-B/16 \citep{radford2021learning} and examine its representation space in terms of datasets' tags.
Table~\ref{tab:celeba-stats} shows the statistics of the distance between the points conditioned on the number of shared tags.
As seen in the Table~\ref{tab:celeba-stats}, although the average of distance between points with more common tags slightly decreases,
the standard deviation of distance between points is high. In fact, points with many common tags can still be far away from each other.
Last column in Table~\ref{tab:celeba-stats} shows the probability that
the distance between two points with at least $d$ shared tags be larger than the distance of two randomly sampled points.
Even when at least $5$ tags are shared between two points, with the probability of $0.34$, the distance can be larger than two random points. 
% \mm{great!!!}
Thus, if we plant a failure mode on a group of images sharing a subset of tags,
these clustering-based methods cannot find a~group consisting of \emph{only} those images; they will inevitably include other irrelevant images,
leading to an incoherent failure mode set and, consequently, a low-quality description.
This can be observed in Appendix~\ref{app:dom_out} where we include DOMINO's output.

We also run another experiment to foster our hypothesis that distance-based clustering methods cannot fully capture semantic similarities.
We randomly pick an image $x$ and find $N$ closest images to $x$ in the feature space.
Let $C$ be the set of these images.
We inspect this set in terms of the number of tags that commonly appear in its images as recent methods \citep{eyuboglu2022domino, deon2021spotlight, jain2022distilling},
take the average embedding of images in $C$ and then assign a text to describe images of $C$. 
Table~\ref{tab:celeba-stats-rev} shows the average number of tags that appear in at least $\alpha N$ images of set $C$ (we sample many different points $x$).
If representation space is a~good proxy for semantic space, then we expect a~large number of shared tags in close proximity to point $x$.
At the same time, for the point $x$, we find the maximum number of tags that appear in $x$ and at least $N$ other images.
This is the number of shared tags in close proximity of point $x$ but in semantic space.
As shown in Table~\ref{tab:celeba-stats-rev}, average number of shared tags in semantic space is significantly larger than the average number of shared tags in representation space.
% We refer to Appendix~\ref{app:cub200} for the same results on CUB-200 dataset.

% This shows that a cluster in the latent space may not contain many images with common semantic attributes. \mm{I think this is a stronger claim than the experiment just showed}
% Hence, methods that find failure modes using clustering in latent space cannot detect high quality human-understandable failure modes.
% Note that tagging models are getting better and better and we can expect superior methods in near future, thus, putting interpretability first leads to better descriptions for a subgroup of images. \mm{a bit underwhelming of an ending; I'd prioritize a sharp conclusion over this last experiment. You can mention the finding of the experiment and then refer to the appendix.}

% for a tag $t$ to be \textit{retrievable} in the text, we can assume it should appear in a~fraction of at least $\alpha$ images in $C$.
% Let $n_{\text{rep}}$ be the number of retrievable tags.
% We also brute-force over a subset of tags $S$ in a way that $x$ has all tags of $S$ and at least $2N$ images have all tags of $S$.
% We denote by $n_{\text{ideal}}$ a lowerbound on the size of $S$.
% Figure~\ref{fig:annots} shows the distribution of $n_{\text{ideal}}$ and $n_{\text{rep}}$ over a group of randomly sampled images $x$.
% We pick $\alpha=0.75$ and $N=100$ in our experiments.

% By comparing $n_{\text{rep}}$ and $n_{\text{ideal}}$, we get the sense over the amount of lost tags. In fact, if there is a huge gap between $n_{\text{rep}}$ and $n_{\text{ideal}}$ and we plant a failure mode on images which include tags of $S$, metric-based approaches can describe the failure mode with around $n_{\text{rep}}$ tags while it could be described by around $n_{\text{ideal}}$ tags if we had the ideal tagging model.


\section{Conclusions}
In this study, drawing from the observation that current techniques in human-comprehensible failure mode detection sometimes produce incoherent descriptions,
along with empirical findings related to the latent space of vision-language models,
we introduced \method, a novel approach that prioritizes interpretability in failure mode detection.
Our results demonstrate that it generates descriptions that are more similar, coherent, and specific compared to existing methods for the detected failure modes.

\section*{Acknowledgments}
    This project was supported in part by a grant from an NSF CAREER AWARD 1942230, ONR YIP award N00014-22-1-2271, ARO’s Early Career Program Award 310902-00001, Meta grant 23010098, HR00112090132 (DARPA/RED), HR001119S0026 (DARPA/GARD), Army Grant No. W911NF2120076, NIST 60NANB20D134, the NSF award CCF2212458, an Amazon Research Award and an award from Capital One.

% \section*{Acknowledgments}
%     This project was supported in part by a grant from an NSF CAREER AWARD 1942230, ONR YIP award N00014-22-1-2271, ARO’s Early Career Program Award 310902-00001, Meta grant 23010098, HR00112090132 (DARPA/RED), HR001119S0026 (DARPA/GARD), Army Grant No. W911NF2120076, NIST 60NANB20D134, the NSF award CCF2212458, an Amazon Research Award and an award from Capital One.

% \newpage
\bibliography{arxiv}
\bibliographystyle{arxiv}

\newpage
\appendix
\input{appendix}

\end{document}
