\section{Appendix}

\begin{thisnote}

\subsection{Limitations}
\label{app:limitation}
\paragraph{Utilizing pre-trained tagging models}
We note that in this work, we used RAM \citep{zhang2023recognize} to extract tags from images and then used those tags to identify hard subpopulations.
Accuracy of RAM which is the state-of-the-art tagging model is well studied in \citep{zhang2023recognize}.
Authors observe high-quality performance of their method on different common datasets over different tasks.
We also conducted a small-scale human validation study to measure the accuracy of RAM on Living17 dataset (see Appendix~\ref{app:ram-eval} for more details) and observe strong performance.
However, we acknowledge that \method{}, like other existing methods, utilizes an auxiliary model to bridge the gap between vision and language modalities. Thus, its performance is affected and limited by the auxiliary method.
% e.g., existing works \citep{jain2022distilling, eyuboglu2022domino, deon2021spotlight} highly depend on the performance CLIP.
Notably, our use of the auxiliary model (tagging models) is closely aligned to the task they are optimized for. That is, these models are trained to efficiently detect various objects and concepts in images, assigning tags accordingly. As tagging models advance, \method{}'s effectiveness is expected to be enhanced.

\paragraph{\method{} for specific domains}
We note that \method{} relies on tagging model to extract informative tags from image, making it less effective on specific domains for which the tagging model is not optimally suited.
However, we believe it is highly likely that tagging models like RAM that are finetuned on specific domains will arise very soon,
similar to how ConVIRT (CLIP for medical data) \citep{zhang2022contrastive} came about soon after CLIP.
Leveraging finetuned tagging models, \method{} remains effective for failure mode extraction on specific domains such as medical images.
% We note that this the case in existing work \citep{eyuboglu2022domino, deon2021spotlight, jain2022distilling} as they also need vision-language models finetuned for specific domains to remain effective.

\paragraph{Tag interpretation}
Acknowledging potential challenges in interpreting certain tags within \method{}, particularly adjectives, is imperative.
For instance, the tag ``white" may not consistently align with the primary object in the image,
as its detection could be influenced by background elements or other objects present.
Despite this, the utility of \method{} persists, serving as a valuable tool to highlight instances where model failures occur at the convergence of specific tags,
even if not exclusively tied to the main object in the image.
We note that, due to this limitation, we do not use language models to generate descriptions from tags (except for validating \method{} with image generation) because language models may associate tags with class labels, which is not necessarily correct. Therefore, in evaluating the quality of descriptions, we used a bag-of-word manner to obtain \method{}'s results.
\end{thisnote}

\subsection{Training Models}
\label{sec:app-training}
We train a model on each dataset we are considering.
\begin{itemize}
    \item For ImageNet, we use the standard pretrained ResNet50 model \citep{he2015resnet}.
    \item For Living17, Entity13, and NonLiving26, we utilize a DINO self-supervised model \citep{caron2021emerging} with ResNet50 backbone and fine-tune it over Living17.
        we used SGD with following hyperparameters to finetune the model for $5$ epochs.
        \begin{itemize}
            \item lr $= 0.001$
            \item momentum $= 0.9$
        \end{itemize}
    \item For CelebA (age classification), we used a pretrained ResNet18 model \citep{he2015resnet} which is finetuned for 5 epochs using SGD the following hyperparameters.
        \begin{itemize}
            \item lr $= 0.001$
            \item momentum $= 0.9$
            \item weight decay $= 5e-4$
        \end{itemize}
        We note that classifier is trained in a way that it is biased toward images of young women and old men.
    \item For Waterbirds, we fine-tuned a pretrained ResNet18 model \citep{he2015resnet} for 20 epoch using SGD the following hyperparameters.
        \begin{itemize}
            \item lr $= 0.001$
            \item momentum $= 0.9$
            \item weight decay $= 5e-4$
        \end{itemize}
\end{itemize}

\subsection{Detailed Results of Detected Failure Modes}
\label{app:details}
Our method on \textbf{Living17}:\\
\textbf{Results}: $36$ failure modes with $1$ tag, $68$ failure modes with $2$ tags, $24$ failure modes with $3$ tags, and $4$ failure modes with $4$ tags.\\
\textbf{Hyperparameters}: $s=30$, $a=30$, $b_2=10\%$, $b_3=5\%$, and $b_4=2.5\%$.

\begin{itemize}
    \item class ``wolf" (Accuracy: $83.69\%$):     
        
        hide ($54.86\%$); ----
        floor + hide ($38.71\%$); ----
        floor + hide ($38.71\%$); ----
        den ($49.06\%$); ----
        den + hide ($22.22\%$); ----
        lay + red ($41.86\%$); ---- 
        den + lay ($29.27\%$); ----
        hide + lay ($41.51\%$); ----
        night ($44.44\%$); ----
        floor + cub ($48.48\%$); ----
        floor + den + cub ($22.22\%$); ---- 
        grass + stare + hide + red ($66.67\%$); ----
        log + red ($62.86\%$); ----
        grass + tree + brown ($63.64\%$);

    \item class ``cat" (Accuracy: $89.58\%$):
    	enclosure ($52.63\%$); ----
    	zoo ($50.00\%$); ----
    	habitat ($34.09\%$); ----
    	grassy ($63.64\%$); ----
    	tiger + walk ($28.57\%$); ----
    	tiger + grass ($58.54\%$); ----
    	bengal tiger + walk ($28.57\%$); ----
    	bengal tiger + grass ($60.00\%$); ----
    	floor + tree ($57.14\%$); ----
    	log ($56.76\%$); ----
    	white + grass ($63.64\%$); ----
    	tiger + tree ($37.50\%$); ----
    	hide + stand ($82.50\%$);
\end{itemize}

Our method on \textbf{Entity13}:\\
\textbf{Results}: $45$ failure modes with $1$ tag, $45$ failure modes with $2$ tags, and $18$ failure modes with $3$ tags.\\
\textbf{Hyperparameters}: $s=100$, $a=30$, $b_2=10\%$, $b_3=5\%$, and $b_4=2.5\%$.
\begin{itemize}
    \item class ``wheeled vehicle" (Accuracy: $88.05\%$):

    	shopping cart ($53.03\%$); ----
	    floor + cart ($54.17\%$); ----
	    sit + shopping cart ($43.90\%$); ----
	    cage ($44.20\%$); ----
	    basket ($50.45\%$); ----
	    man + pole ($65.79\%$); 

    \item class ``produce, green goods, green groceries, garden truck" (Accuracy: $92.91\%$):
    	floor + food ($74.77\%$);

     \item class ``accessory, accoutrement, accouterment" (Accuracy: $63.98\%$):
        swimwear + pose ($18.18\%$); ----
        stand + pose + black ($31.15\%$); ----
        brunette ($19.23\%$); ----
        swimwear + brunette ($4.20\%$); ----
        person + graduation ($33.92\%$);
\end{itemize}

Our method on CelebA (Young vs. Old classification):\\
\textbf{Results}: $45$ failure modes with $1$ tag, $27$ failure modes with $2$ tags, and $11$ failure modes with $3$ tags.\\
\textbf{Hyperparameters}: $s=100$, $a=30$, $b_2=10\%$, $b_3=5\%$, and $b_4=2.5\%$.

\begin{itemize}
    \item class ``young" (Accuracy: $80\%$):

        beard ($32.34\%$); ----
	    man + laugh ($41.21\%$); ----
	    smile + tie + stand ($58.27\%$); ----
        man + goggles ($41.04\%$); ----
	    man + sunglasses ($36.11\%$); ----
	    man +  white +  stand ($69.32\%$); ----
	    man + sing ($33.78\%$); ----
	    man + microphone ($40.00\%$); ----
	    business suit + smile + stand ($51.11\%$); ----
	    black + goggles ($42.65\%$);
\end{itemize}

Our method on \textbf{CelebA} (Young vs Old classification):\\
\textbf{Results}: $45$ failure modes with $1$ tag, $27$ failure modes with $2$ tags, and $11$ failure modes with $3$ tags.\\
\textbf{Hyperparameters}: $s=100$, $a=30$, $b_2=10\%$, $b_3=5\%$, and $b_4=2.5\%$.

\begin{itemize}
    \item class ``young" (Accuracy: $80\%$):

        beard ($32.34\%$); ----
	    man + laugh ($41.21\%$); ----
	    smile + tie + stand ($58.27\%$); ----
        man + goggles ($41.04\%$); ----
	    man + sunglasses ($36.11\%$); ----
	    man +  white +  stand ($69.32\%$); ----
	    man + sing ($33.78\%$); ----
	    man + microphone ($40.00\%$); ----
	    business suit + smile + stand ($51.11\%$); ----
	    black + goggles ($42.65\%$);
\end{itemize}

Our method on \textbf{Waterbirds}:\\
\textbf{Results}: $4$ failure modes with $1$ tag, $8$ failure modes with $2$ tags, and $9$ failure modes with $3$ tags.\\
\textbf{Hyperparameters}: $s=100$, $a=30$, $b_2=10\%$, $b_3=5\%$, and $b_4=2.5\%$.

\begin{itemize}
    \item class ``landbird" (Accuracy: $87.41\%$):
    
        black + sea ($62.50\%$); ----
	    crow + water ($64.58\%$); ----
	    water + person ($69.86\%$); ----
	    black + water + beak ($71.43\%$); ----
	    water + man ($71.93\%$); ----
	    stand + sea + blue ($65.71\%$); ----
	    sea + sit + boat ($75.68\%$); ----
	    black + ledge + sea ($58.82\%$);
        
     \item class ``waterbird" (Accuracy: $33.00\%$):
        wood ($10.47\%$); ----
	    stem ($6.67\%$); ----
	    stand + tree + pole ($16.67\%$);
\end{itemize}

\subsection{Visualization on Some of the Detected Failure Modes}
\label{sec:app-vis}
We refer to Figures~\ref{fig:teaser-appendix}, \ref{fig:teaser-appendix3}, and \ref{fig:teaser-appendix4} for more visualization of failure modes detected in our approach on Living17, ImageNet, and Waterbirds.
\input{figures/first_figure/figure2}
\input{figures/first_figure/figure3}
\input{figures/first_figure/figure4}

% \subsection{Running the method using different values of $(s, a)$}
\subsection{Running the method using different values of $(s, a)$}
\label{sec:app-params}
\input{figures/generalization/figure}

In this section, we inspect the effect of different hyperparameters $(s, a)$ on the result of our method.
By increasing $a$, we aim to detect harder subpopulations, thus, the number of detected failure modes will decrease.
By increasing $s$, we detect a group of images associated with a set of tags as a failure mode, if there are a significant number of images within that group.
This brings more generalization over detected failure modes while a fewer number of them will be detected by the method.
Figure~\ref{fig:gen-appendix} shows the generalization plot over different datasets with respect to different hyperparameters.

\begin{thisnote}
In Table~\ref{tab:corrcoef}, we also report correlation coefficien between train drop and test drop of failure modes over different datasets and values of $s$ and $a$.    
\end{thisnote}



\input{figures/corrcoef/table}

\input{figures/appendix/generalization/figure}

\subsection{Greedy Search}
\label{sec:greedy}
We note that in our experiments, exhaustive search was efficient enough so that we do not need to consider any other approaches.
We used some heuristic approaches to improve the efficiency of exhaustive search such as eliminating combination of tags that a few images represent them, etc.
However, we developed another greedy search algorithm where at each stage, we pick top tags that condition on them, significantly dropping model accuracy.
By running this approach, the space of different choices shrinks and the algorithm becomes faster while missing some of the failure modes.

\subsection{DOMINO's output descriptions}
\label{app:dom_out}

To compare the results of DOMINO with \method{}, we picked DOMINO's hyperparameters in a way that generates relatively the same number of failure modes.
Some of the outputs on Living17 dataset are as follows:

\begin{itemize}
    \item class ``salamander": 
    \begin{itemize}
        \item  a photo of the bullet wound.
        \item a photo of a lizard.
        \item a photo of trout fishing.
        \item a photo of a frog.
        \item a photo of a hippo.
        \item a photo of the ventral fin.
        \item ...
    \end{itemize}
    \item class ``fox":
    \begin{itemize}
        \item a photo of a gorillas.
        \item a photo of the titanic sinking.
        \item a photo of the tract.
        \item a photo of a coyote.
        \item a photo of oil shale.
        \item a photo of the desert.
        \item a photo of the antarctic.
        \item a photo of the arctic.
        \item ...
    \end{itemize}
    \item class ``cat":
    \begin{itemize}
        \item a photo of the zoological garden.
        \item a photo of stray dogs.
        \item a photo of two dogs.
        \item a photo of the gardener.
        \item a photo of the tehsil leader.
        \item ...
    \end{itemize}
\end{itemize}

\input{figures/domino_vis/figure}
Figure~\ref{fig:dom_out} shows some visualization of these failure modes.
Lack of coherency among images in some of the groups as well as low-quality descriptions can be seen in detected failure modes.

\subsection{Quality of Description}
\label{app:qual}
We note that failure modes detected by \method{} and DOMINO might be different.
However, our metrics only evaluate how well failure modes are described with their corresponding captions and do not consider what images are assigned as failure modes.
This enables us to compare different methods with each other.
Notably, hyperparameters of different methods play a role in the number of failure modes that a method detects.
In order to ensure a fair comparison, we carefully set the hyperparameters for both of methods to yield a similar number of detected failure modes.

Furthermore,
it is worth noting that as the similarity score is normalized, comparing this score over different failure modes and different methods is possible. This is why we aggregate similarity score, standard deviation, auroc over all failure modes.

\subsection{DOMINO's generalization}
\label{app:dom_gen}

To compare our results with DOMINO \citep{eyuboglu2022domino},
we note that this method also outputs some groups as well as descriptions for them.
For a failure mode $I_j$ with $T_j$ as its description,
we use the same vision-language model that DOMINO uses to collect highly similar images to caption $T_j$ in $\Dtest$ and obtain $I'_j$.
We then evaluate model's accuracy on images of $I'_j$ and expect a hard subpopulation, it should be as hard as $I_j$.
Figure~\ref{fig:dom_generalization} shows the generalization of this method.
We see a lower degree of generalization in this approach than our method.
We observe that generated captions cannot fully describe as hard subpopulations as subpopulations detected on $\Dtrain$.

We note that the other method \citep{jain2022distilling},
only detects a single failure mode for each class in the input and reports around $10$ images for that, thus,
comparing this method with DOMINO and ours is a bit unfair as those methods detect multiple failure modes with significantly more coverage.

\input{figures/domino_gen/figure}

\begin{thisnote}
\subsection{Image Generation}
\label{subsec:image-gen}

In this section, we elaborate more on the way we generate hard and easy images.
To detect easy subpopulations,
we randomly pick $2$ subset of tags in a way that the model's accuracy on images representing those tags is $100\%$. 
For the failure modes, we randomly pick two of the detected failure modes in a way that those two groups do not share any common images.
It is worth noting that for classes ``butterfly" and ``dog" we don't report any results as model's accuracy for these classes is almost $100\%$.
Figure~\ref{fig:generation_results_bar} shows the accuracy gap for different classes.

Here we provide some of the failure/success modes we took and corresponding descriptions used for generative models.

\begin{itemize}
    \item 
    class: ``bear"
    \begin{itemize}
        \item gray + water $\rightarrow$ “a photo of a gray bear in water”;
        \item river $\rightarrow$ “a photo of a bear in the river”
        \item cub + climbing + tree $\rightarrow$ “a photo of a bear cub climbing a tree”;
        \item black + cub + branch $\rightarrow$ “a photo of a black bear cub on a tree branch”;
    \end{itemize}
    \item
    class: ``ape"
    \begin{itemize}
        \item black + branch $\rightarrow$ “a photo of a black ape on a tree branch”;
        \item sky $\rightarrow$ “a photo of an ape in the sky”;
        \item gorilla + trunk + sitting $\rightarrow$ “a photo of a gorilla ape sitting near a tree trunk”;
        \item mother $\rightarrow$ “a photo of a mother ape”;
    \end{itemize}
\end{itemize}


\input{figures/generation/acc}

\end{thisnote}

\subsection{Jain et al. Quality of description}
\label{subsec:madry-quality}

We note that \cite{jain2022distilling} detects $10$ hard and $10$ easy images for each class in the dataset and assigns a description to them. 
However, the way this method generates captions needs humans in the loop as they use a bunch of dataset-oriented words (tokens)
that explain different variants of semantic attributes in the dataset.
Hence, their method is not scalable to run on many different datasets so we only considered that on Living17 and CelebA.
When datasets become larger and more complex, there will exist several failure modes, and \cite{jain2022distilling} that only extracts a single direction cannot cover many of the failure inputs.
Figure~\ref{fig:barplots-madry} shows the results on different datasets.
We note that on CelebA, they use a very detailed and manually collected set of tokens to generate outputs.
However, even in that dataset, their performance is slightly better than ours.

\input{figures/detailed_descriptions/barplots_madry}

\subsection{Number of Tags in the Descriptions}
In Table~\ref{table:number-of-tags-app}, we bring a more detailed version of Table~\ref{tab:num-of-tags-exp} that includes some sampled images of detected failure modes.
\input{figures/num_of_tags/table-full}

% \newpage
\subsection{CUB-200 and CelebA}
\label{app:cub200}

\input{figures/annots/table}

In this section, we show the results reported in Section~\ref{subsec:rev} over CUB-200 dataset.
Table~\ref{tab:cub200-rev} includes the results of shared tags in the proximity of images and Table~\ref{tab:cub200-stats}
includes the statistics of distance between two images in the latent space.


\begin{table}
\caption{Statistics of the distance between two points in CUB-200 dataset conditioned on number of shared tags. Distances are reported using CLIP ViT-B/16 representation space.}
\label{tab:cub200-stats}
% \vskip -0.15in
\centering
\resizebox{0.5\linewidth}{!}{
\begin{tabular}{c|c|c|c}
\toprule
$\#$ of shared tags $\geq d$ & mean & standard deviation & Probability\\
\midrule
\hline
$d = 0$ & 7.37 & 0.92 & 0.50\\ \hline
$d = 3$ & 7.31 & 0.94 & 0.48\\ \hline
$d = 9$ & 7.06 & 1.08 & 0.42 \\ \hline
$d = 15$ & 6.29 & 1.88 & 0.30\\ \hline
$d = 24$ & 3.42 & 3.18 & 0.12\\ \hline
\bottomrule
\end{tabular}}
\vskip -0.1in
\end{table}

\input{figures/annots/table_app}

\begin{thisnote}
\subsection{RAM evaluation on Living17}
\label{app:ram-eval}
we run a small-scale human validation study on Living17 (one of the main datasets we used in our paper) to evaluate the accuracy of RAM.
We first obtain all tags of class ``butterfly" in Living17 and filter out low-frequency tags as discussed in \ref{sec:rel-tags}.
$55$ tags are remained, i.e.,
$$
    T_{\text{butterfly}} = \{\text{black}, \text{purple}, \text{red}, \text{flower}, \text{leaf}, \text{sit}, \text{land}, \text{gravel}, \text{stone}, \text{mud}, \text{wildflower}, \text{sky}, \text{grass}, ...\}.
$$

We take $100$ random images from the class ``butterfly" in Living17 and evaluate precision/recall of tagging model on those images over tags of $T_{\text{butterfly}}$.
Average Precision of RAM over those $100$ images is $86.85\%$ and average recall is $81.85\%$.
This shows that RAM is accurate in detection of tags in $T_{\text{butterfly}}$.
It is worth noting that $T_{\text{butterfly}}$ includes a wide range of different objects and attributes, covering a wide range of concepts in those images.


\subsection{\method{} hyperparameters}
\label{app:hyperparams}
In this section, we elaborate more on each of the hyperparameters in our work.
We note that all of the \method{}'s hyperparameters have intuitive definitions, enabling a user to calibrate \method{} toward their specific preferences.
 
\begin{itemize}
    \item 
    Parameter $a$ controls a trade-off between the difficulty and the quantity of detected failure modes.
    For example, selecting a high value of $a$ results in failure modes that are more difficult but fewer in number.
    Figure~\ref{fig:gen-appendix} shows this trade-off.
    \item
    Parameter $s$ determines the minimum number of required images inside a group to be detected as a failure mode.
    Small groups may not be reliable, thus, we filter them out. Larger value for $s$ results in more reliable and generalizable failure modes. The choice of $s$ also depends on the number of samples within the dataset. For larger datasets, we can assume that different subpopulations are sufficiently represented in the dataset, thus, a larger value for $s$ can be used. We refer to Figure~\ref{fig:gen-appendix} for observing the effect of $s$.
    \item
    $l$ determines the maximum number of tags we consider for combination. In datasets we considered, a combination of more than $5$ tags would not result in groups with at least $s$ images, thus, we set $l \leq 4$ in our experiments. The choice of $l$ depends on the dataset and tagging model. 
    \item
    $b_i$ refers to the degree of necessity for tags inside the failure modes. The current choice of $b_i$s is only a sample choice, requiring the appearance of each tag to have a significant impact on the difficulty of detected failure modes. One can adjust these hyperparameters based on their preference.
\end{itemize}
\end{thisnote}

% \subsection{Captions Generated Using Different Methods}
% \label{sec:all-methods}
% We show some of the captions of failure modes detected by our approach, DOMINO\citep{eyuboglu2022domino}, and Distilling Failure Direction\citep{jain2022distilling}
% on Living17 dataset for some of the classes.

% \ms{for domino, mention that some descriptions are not informative at all, and some others are too simple and don't result in high accuracy drop}

% \begin{itemize}
%     \item Class ``Fox": 
%     \begin{itemize}
%         \item DOMINO: ``a photo of a coyote.", ``a photo of oil shale.", ``a photo of the desert.", ``a photo of the antarctic.", ...
%         \item Ours: ``stand; walk; coyote;", ``cage; fence; pen;", ``tree; grassy;", ``grass; walk; tree;", ....
%         \item Distilling: .... \kr{Ask Mehrdad!}
%     \end{itemize}
%     \item Class ``Ape": 
%     \begin{itemize}
%         \item DOMINO: ``a photo of a eucalyptus tree.", ``a photo of carrier pigeons.", ``a photo of wishbone.", ``a photo of the netting.", ...
%         \item Ours: ``hang; branch; black;", ``sky;", ``tree; branch; tree branch; black;", ....
%         \item Distilling: .... \kr{Ask Mehrdad!}
%     \end{itemize}
    
% \end{itemize}