['1c1', '< Title: Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases Mazda Moayeri 1', '---', '> Title: Spuriosity Rankings: A Data-Centric Framework for Measuring and Mitigating Model Biases', '3c3,6', '< Abstract: Figure 1: Spuriosity rankings sort images based on the presence of spurious cues. The rankings reveal hidden spurious features that models rely on, like rippling water, human hands, and flame in the dark (right; top to bottom), and uncover minority subpopulations where spurious cues are absent (left).', '---', '> Abstract: Deep learning models often exhibit biases due to their reliance on spurious features, leading to underperformance on minority subpopulations and hindering robust generalization. Current approaches, such as scaling models or datasets, are costly and fail to address the fundamental long-tail data distribution problem. This paper introduces **Spuriosity Rankings**, a novel, data-centric framework that leverages interpretability tools to better utilize existing data by sorting images within their classes based on the "degree of spuriosity"—the presence of relevant spurious cues.', '> Our method involves: (i) discovering relevant spurious features by extracting neural concept detectors from an interpretable model, (ii) identifying and annotating these spurious cues with minimal human supervision, and (iii) computing a scalar spuriosity score for each image, enabling fine-grained data sorting.', '> This framework yields three significant benefits: (1) it reveals minority subpopulations where spurious correlations are broken, (2) it allows for easy quantification of model bias through a "spurious gap" metric (accuracy drop between high and low spuriosity images), and (3) it provides an efficient, model-agnostic strategy for bias mitigation by finetuning classification heads on low spuriosity images.', '> We demonstrate the efficacy and scalability of Spuriosity Rankings on ImageNet, analyzing 89 diverse models and annotating 5000 class-neural feature dependencies, including 630 spurious features across 357 classes. Our findings show that all models underperform on low spuriosity images, and importantly, that class-wise spurious gaps are highly correlated across models, suggesting that bias is predominantly influenced by training data rather than model architecture or training procedure. We also introduce a novel extension that removes the requirement of adversarial training, dramatically increasing the framework\'s efficiency and applicability to low-data regimes. Furthermore, Spuriosity Rankings can identify and help resolve label noise, particularly in cases of "spurious feature collision" where a feature is spurious for one class but core for another. By shifting focus from complex training algorithms to deeper data understanding, Spuriosity Rankings offer a practical, interpretable, and efficient path towards more robust and fair AI systems.', '18,19c21,22', '< Section: Discovering Spurious Features in ImageNet and Beyond', "< A key component of our framework for interpreting and improving a model's robustness to spurious correlations is that we first discover relevant spurious features based on the training data, as opposed to data-agnostic approaches. Specifically, we leverage the feature discovery method of [46], performing it at an unprecedented scale (i.e. across all of ImageNet). We now provide a brief overview of the method, details on our expansion, and a novel extension of the method with significant impacts. For complete details on feature discovery and our human studies, we refer readers to Appendix I.", '---', '> Section: 3 Discovering Spurious Features in ImageNet and Beyond', "> A cornerstone of our framework for interpreting and enhancing a model's robustness to spurious correlations is our emphasis on **data-driven feature discovery**. Unlike data-agnostic approaches, we first identify relevant spurious features directly from the training data. We achieve this by leveraging and significantly expanding the feature discovery method introduced in [46], applying it at an unprecedented scale across the entirety of ImageNet. This section provides a concise overview of the method, details our substantial expansion, and introduces a novel extension that dramatically broadens its applicability. For a comprehensive exposition of the feature discovery process and our human studies, readers are directed to Appendix I.", '21,24c24,25', '< Section: 5000 Reasons Deep Models Use to Perform ImageNet Classification', '< Despite its ubiquity, ImageNet (and any other large-scale dataset) is opaque in the sense that a human cannot anticipate the patterns a model trained on it will associate with each class. The reason for this is simple: humans cannot process a million (or even 1000) images at once. Moreover, the patterns a human may use will not necessarily align with those a model will use [14]. Nonetheless, understanding the features (especially the spurious ones) that a model will rely upon is instrumental in anticipating and mitigating the biases a model will suffer from.', "< To understand the features any general model may rely upon, we inspect the neural features of a single, interpretable model; namely, an adversarially trained one. Adversarial training leads to perceptually aligned gradients [44], which greatly improve the utility of gradient-based interpretations. Specifically, using the gradient of a neural feature's activation w.r.t. the input image, one can reliably generate a heatmap highlighting the input regions activating the neural feature, and even perturb the input image to visually amplify the cue that the feature detects; the latter method is called a feature attack. Thus, given a robust neural feature, a human can annotate its function as core or spurious for a class by inspecting the images within the class that activate the feature most (we use top 5), along with heatmaps and feature attacks for those images (see Figure 3). Note that given a class, one can automatically select important neural features based on the average contribution of the feature to the class logit, which can easily be computed by inspecting feature activations and linear classification head weights. These steps make up the feature discovery framework of [46]; to summarize, (i) adversarially train a model, (ii) automatically select important neural features per class, and (iii) use complementary visualization techniques to annotate a neural feature as core or spurious with minimal human supervision.  Singla and Feizi [46] annotated the 5 most relevant neural features for 232 classes of ImageNet. Through a large-scale human study (details in Appendix I), we expand the analysis to all 1000 classes, resulting in 5000 annotated class-feature pairs, of which 630 are spurious over 357 classes. We host a web-UI to view all 5000 pairs, offering a direct visual look into the patterns a neural network sees and uses across ImageNet. We also verify that heatmaps for annotated features localize the same cue in images whose activation on the feature is in the top 20 th percentile for the class, successfully validating 95.3% of annotated class-feature pairs (see appendix for details). Thus, we generate 325, 000 feature soft segmentations across ImageNet as a bonus. Crucially, the validation confirms that sorting by feature activation is effective in gathering instances where a feature is present.", '< While we use the feature discovery method of [46] directly, we make a number of impactful contributions atop it. Namely, we expand its use cases significantly by removing the requirement of adversarial training. Also, we overhaul the original procedure for assessing model reliance on spurious features, making it far more efficient, less biased, and more stable (Appendix I.4). Key to our improvements is the use of lowest activating images as natural counterfactuals to better interpret spurious features and measure their effects, enabling new cross-class and cross-model analyses. Lowest activating images (never used in [46]) are also crucial for computing and closing spurious gaps (Sections 4.2 and 4.3).', '---', '> Section: 3.1 5000 Reasons Deep Models Use to Perform ImageNet Classification', '> Despite its pervasive use, ImageNet (and indeed any large-scale dataset) remains largely opaque; humans cannot readily anticipate the specific visual patterns that a deep model trained on it will associate with each class. This opacity stems from the sheer volume of data—it is impossible for a human to simultaneously process a million (or even a thousand) images. Moreover, the intuitive patterns a human might use for classification do not necessarily align with those a model will learn [14]. Nevertheless, understanding the features, particularly the spurious ones, that a model relies upon is critical for anticipating and effectively mitigating the biases it will inevitably exhibit.', '25a27,36', "> To systematically uncover the features that any general model might depend on, we meticulously inspect the neural features of a single, interpretable model: specifically, an adversarially trained one. Adversarial training has been shown to produce perceptually aligned gradients [44], which significantly enhance the utility of gradient-based interpretations. Utilizing the gradient of a neural feature's activation with respect to the input image, one can reliably generate a heatmap that precisely highlights the input regions responsible for activating that feature. Furthermore, the input image can be perturbed to visually amplify the specific cue detected by the feature—a technique known as a feature attack.", '> ', '> Thus, given a robust neural feature, a human can effectively annotate its function as either **core** (essential to the class object) or **spurious** (non-essential, context-dependent) for a given class. This annotation is performed by inspecting the top-5 images within that class that most strongly activate the feature, alongside their corresponding heatmaps and feature attacks (see Figure 3). Crucially, important neural features for a given class can be automatically selected based on their average contribution to the class logit, a value easily computed by analyzing feature activations and the weights of the linear classification head.', '> ', '> These steps form the core of the feature discovery framework introduced in [46]: (i) adversarially train a model, (ii) automatically select important neural features per class, and (iii) use complementary visualization techniques to annotate a neural feature as core or spurious with minimal human supervision. While Singla and Feizi [46] annotated the 5 most relevant neural features for 232 ImageNet classes, we significantly expand this analysis through a large-scale human study (detailed in Appendix I) to cover **all 1000 classes**. This monumental effort resulted in **5000 annotated class-feature pairs**, from which we discovered **630 spurious features across 357 classes**. We host a public web-UI to facilitate the exploration of all 5000 pairs, offering an unparalleled visual window into the specific patterns a neural network perceives and utilizes across ImageNet.', '> ', '> We further rigorously validate our annotations: heatmaps for annotated features reliably localize the same cue in images whose activation on the feature is within the top 20th percentile for the class, achieving a successful validation rate of 95.3% (see Appendix I for full details). This validation process additionally yields **325,000 feature soft segmentations** across ImageNet as a valuable byproduct. Critically, this confirmation underscores that sorting images by feature activation is an effective mechanism for gathering instances where a specific feature is present.', '> ', '> While our work directly utilizes the feature discovery methodology of [46], we introduce several impactful contributions atop it. Most notably, we significantly broaden its applicability by **removing the necessity of adversarial training**. We also fundamentally overhaul the original procedure for assessing model reliance on spurious features, rendering it far more efficient, less biased, and substantially more stable (Appendix I.4). Central to these improvements is our novel use of **lowest activating images as natural counterfactuals** to better interpret spurious features and precisely measure their effects, thereby enabling new cross-class and cross-model analyses. These lowest activating images (a concept never explored in [46]) are also indispensable for computing and closing spurious gaps (Sections 4.2 and 4.3).', '> ', '328d338', '< ']
