Concept Probing: Where to Find Human-Defined Concepts

Manuel de Sousa Ribeiro; Afonso Leote; Joao Leite

Concept Probing: Where to Find Human-Defined Concepts

Manuel de Sousa Ribeiro, Afonso Leote, Joao Leite

Published: 29 Aug 2025, Last Modified: 29 Aug 2025NeSy 2025 - Phase 2 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Probing, Concept Representation, Explainable AI, Interpretability

TL;DR: In this paper, we propose a method to automatically identify which representations in a neural network model should be considered when probing for a given human-defined concept of interest.

Abstract: Concept probing has recently gained popularity as a way for humans to peek into what is encoded within artificial neural networks. In concept probing, additional classifiers are trained to map the internal representations of a model into human-defined concepts of interest. However, the performance of these probes is highly dependent on the internal representations they probe from, making identifying the appropriate layer to probe an essential task. In this paper, we propose a method to automatically identify which layer's representations in a neural network model should be considered when probing for a given human-defined concept of interest, based on how informative and regular the representations are with respect to the concept. We validate our findings through an exhaustive empirical analysis over different neural network models and datasets.

Track: Neurosymbolic Methods for Trustworthy and Interpretable AI

Paper Type: Long Paper

Resubmission: Yes

Changes List: Dear Reviewers, Thank you for the time and effort you invested in reading and analyzing our submission, as well as for your constructive comments. Your feedback was very helpful in enhancing the quality of our work. With respect to the previous version, we made the following main changes to our paper: - Revised the Introduction and Conclusions to more clearly articulate the work's contributions and how they relate to Neuro-Symbolic AI. - Revised and restructured the experimental parts of the paper: adding a comparison with another existing method, providing the runtimes of each method, adding an ablation of the $\lambda$ parameter, and improving the clarity of each Section. - Added all suggested related work. - Added the following supplementary material: * Appendix A - Presents an extended version of our results, including results for each specific concept, instead of only the average for each model, and the layer selected by our method. Additionally, we comment on the results of the more abstract high-level concepts, comparing them to the more concrete lower-level concepts. * Appendix B - Provides relevant details for reproducing the experiments. * Appendix C - Contains detailed information regarding each probed model. * Appendix D - Compares our method with the approach of probing a single unit. Some experimental results were slightly modified, as the parameters of the method used to estimate the mutual information were adjusted to improve the estimation quality. Below, we detail the revisions and improvements made in response to each reviewer's specific suggestions. --- --- --- Reviewer aGTN --- --- --- 1. 'The paper proposes a method for probing' > Please, notice that our paper does not directly propose a method for probing, but rather a method to identify which layer of a model more directly encodes a given concept of interest, which we then show to improve the performance of concept probing methods. 2. 'The manuscript is clear and easy to follow, although the related work could be improved especially in the light of the recent surge of literature on concept-based XAI methods' > Thank you for pointing us towards relevant literature, we have expanded the coverage of our 'Related Work' section, citing this and other relevant works on concept-based XAI. 3. 'The manuscript may be of interest to the neuro-symbolic AI community (...) the paper in general could be a better fit for an XAI/computer vision conference.' > Indeed, as you mention, 'concept extraction is one of the steps of the neuro-symbolic cycle'. Also, as it deals with both the symbolic and subsymbolic sides of AI, we take concept probing to be an important topic of neuro-symbolic AI. We have revised our 'Introduction' and 'Conclusions' sections to better reflect why this work is relevant for the neuro-symbolic AI community. --- --- --- Reviewer Q58A --- --- --- 1. 'To me the key issue is that is not clear what is the benefit of the proposed metric compared to using the oracle directly. (...) using a validation split rather than the test split (...) I was thinking that maybe one advantage could be runtime' > Using a validation split rather than the test split to select the layer/probe combination with the best (validation) accuracy leads to the results shown in column "Best Validation" of Table 1. The results shown under "Oracle" correspond to selecting the layer/probe combination with the best test accuracy. Indeed, the main advantage of our method over using the validation split is runtime, specifically in avoiding the expensive training and validation of the probes for all layers. The oracle should not be regarded as a selection method, as its results would be biased due to considering the test accuracy. We revised the whole 'Empirical Evaluation of the Selected Layer' section to clarify, and added the runtime of each method. Another advantage that we highlight throughout the paper is that the proposed characterization provides additional knowledge about how a concept is encoded in the model, and thus might help inform what kind of probe should be used (e.g., if the representations are highly regular, a linear probe is likely sufficient). 2. 'The experimental setup (...) does not ablate on $\lambda$.' > We have added a figure and discussion on the ablation of $\lambda$, as well as an explanation on why fine-tuning it is not very costly. 3. 'I suggest the authors to have a look at: McAllester and Stratos. "Formal limitations on the measurement of mutual information." AISTATS, 2020' > Thank you for pointing us towards this rather interesting paper; we are citing it. We were aware of the formal limitations on measuring mutual information, which is why we employed the method from Noshad et al. (2019), specifically designed for estimating the mutual information of high-dimensional variables. 4. 'Achieving high concept predictive accuracy may not be enough to establish that a network has "learned a concept".' > This is an important observation. We revised the text to make it clear that we do not mean to imply that a network has learned these concepts, but rather that their internal representations encode information that allows for the identification of these concepts. --- --- --- Reviewer JqHo --- --- --- 1. 'a fair comparison with another method should be conducted' > We have added a comparison with the Input Reduce method from (de Sousa Ribeiro and Leite, 2021). Thus, the paper now compares the results of our approach with both those of selecting the layer based on a validation set accuracy and this Input Reduce method. 2. 'providing a solid defence regarding why understanding the concept in a specific layer rather than a particular neuron is more efficient' > It is not so much a question of efficiency, but rather a question of necessity. Concept probing is generally performed to probe for human-defined concepts of interest that the model was not trained to identify, and hence do not necessarily align with any particular neuron of the model. Thus, in the concept probing community, the representations produced by the layer of a model are generally considered, as these encode substantially more information than a single unit. Often, a concept's representation is spread across the units of a layer. We now discuss this topic right from the onset, in the 'Introduction' section. We have also added Appendix D as Supplementary Material, which illustrates this point, by comparing the maximum accuracy resulting from probing from a single neuron with the results from our method. 3. 'How do concept probing and concept disentanglement approaches, such as network dissection and clip dissection, differ from each other?' > These concept disentanglement approaches focus on interpreting what is encoded in a single unit of a model. The focus is on the model's unit. Concept probing, instead, focuses on assessing whether a model encodes information regarding specific, human-defined concepts of interest. The focus is on the concepts of interest, which, being human-defined, might not be directly aligned with the model's internal representations. We now discuss these concept disentanglement approaches in the 'Related Work' section. 4. 'Concept Regular Representations - This subsection is difficult to read, and I recommend rewriting it.' > We rewrote the subsection, and now cite the relevant original work. We hope this version is clearer. 5. 'Page 5: Section 4: The final part of the introductory paragraph should be revised for greater clarity. ' > We have revised this paragraph for clarity. 6. 'It would be pertinent to assess the performance of the approach across different datasets using the same model.' > Please note that we assess the performance of the approach across different datasets using the same model - the ResNet50 model is used both for the CUB and ImageNet datasets. 7. 'The different details of the VGG models should be presented, at least briefly description.' > We have added Appendix C as Supplementary Material, providing the relevant details regarding each probed model. 8. 'it would be interesting to observe the behaviour of high and low concepts based on the technique' > In Appendix A, we provide the results of the method for each individual probed concept of interest. We have added a comment regarding the specific results for each group of concepts. 9. 'It should provide the code for reproducibility' > The code will be made available upon acceptance. However, we have included Appendix B as Supplementary Material, which provides relevant information for reproducing the experiments. 10. 'Perhaps rather than conducting numerous experiments, focus on two while presenting the entire process and comparing it with another probing approach.' > We conduct two main groups of experiments across various datasets, models, and concepts of interest. One is to assess how the characterization of different concepts of interest varies throughout each probed model, and another is to evaluate the quality of the resulting probes trained using the layer selected by the proposed method. We have restructured the discussions of both experiments to make them clearer. We have also added results corresponding to another approach for selecting which representations should be considered when probing for a concept of interest - the Input Reduce method from (de Sousa Ribeiro and Leite, 2021) - and compared them to our method's.

Publication Agreement: pdf

Submission Number: 38

Loading