Abstract: Deep Neural Networks (DNNs) excel at learning complex abstractions within their internal representations. However, the concepts they learn remain opaque, a problem that becomes particularly acute when models unintentionally learn spurious correlations. In this work, we present DORA (Data-agnOstic Representation Analysis), the first data-agnostic framework for analyzing the representational space of DNNs. Central to our framework is the proposed Extreme-Activation (EA) distance measure, which assesses similarities between representations by analyzing their activation patterns on data points that cause the highest level of activation. As spurious correlations often manifest in features of data that are anomalous to the desired task, such as watermarks or artifacts, we demonstrate that internal representations capable of detecting such artifactual concepts can be found by analyzing relationships within neural representations. We validate the EA metric quantitatively, demonstrating its effectiveness both in controlled scenarios and real-world applications. Finally, we provide practical examples from popular Computer Vision models to illustrate that representations identified as outliers using the EA metric often correspond to undesired and spurious concepts.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We have undertaken a revision of the manuscript, addressing the highlighted concerns. These include:
1. The Abstract and Introduction have been simplified for improved clarity and precision.
2. Amendments have been made regarding references to transparency.
3. We have provided comments on the pipeline of representation analysis in Section 4.
4. Definitions 6 and 7 have been refined and concise, with additional background information provided on the differences between natural and synthetic Extreme-Activation distance.
5. We have elaborated on the evaluation experiments, including the motivation behind them.
6. A new paragraph discussing the properties and limitations of Extreme-Activation (EA) distance has been incorporated into Section 3.3.
Furthermore, we have rectified minor grammatical and punctuation errors to enhance the overall readability of the paper. For better understanding, we have added Figures 4, 5, and 7. We have also made other minor adjustments to enhance the overall quality of the paper, such as relocating the EA$_s$ Angle conservation paragraph to the Evaluation section and improving the writing in several sections of the paper.
Video: https://youtu.be/k2tgN7YsjN8
Code: https://github.com/lapalap/dora
Supplementary Material: zip
Assigned Action Editor: ~antonio_vergari2
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 903
Loading