Reproducibility Study of "Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery"
Abstract: The DN-CBM framework proposed by Rao et al. represents a significant advancement in concept-based interpretability, leveraging Sparse Autoencoders (SAEs) for automatic concept discovery and naming. Our study successfully reproduces DN-CBM’s core findings, confirming its ability to extract meaningful concepts while maintaining competitive classification performance across ImageNet, Places365, CIFAR-10, and CIFAR-100. Additionally, we validate DN-CBM’s effectiveness in clustering semantically related concepts in the latent space, reinforcing its potential for interpretable machine learning.
Beyond replication, our extensions provide deeper insights into DN-CBM’s interpretability and robustness. We show that the discovered concepts are more concrete and less polysemantic, favoring monosemantic representations, and that polysemantic concepts have minimal impact on classification. Our intervention analysis on the Waterbirds100 dataset supports DN-CBM’s interpretability, and a novel loss function improves classification accuracy by reducing reliance on spurious background cues. In addition, we show through a user study the advantages of the new loss function on the interpretable concept selection for CIFAR-10. While our automatic concept intervention method offers an alternative to manual interventions, human selection remains more effective. These findings affirm DN-CBM’s validity and highlight opportunities for further refinement in interpretable deep learning.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Addressing the requested changes and comments from the reviews, specifically:
- The introduction was revised to include motivation for the chosen paper and interventions, we added a brief comparison with the CEIR paper, highlighting DN-CBM's computational and practical advantages over CEIR due to its task-agnostic nature and fixed vocabulary.
- Section 3.5 was clarified to detail the carbon footprint calculation, specifying it as the total emissions from all training and exploratory phases using the Machine Learning Emissions Calculator, with explicit mention of the assumptions made.
- Updated figures to higher-resolution.
- Section 6.5 was clarified by adding explanations and highlighting that while CLIP ViT/B-16 shows no correlation between concept concreteness and alignment, CLIP ResNet-50 exhibits a positive correlation, supporting the claim that increased human understandability via concrete concepts doesn't significantly misalign concepts, and may even improve alignment.
- Broadened the user study's sample size to 52 samples.
- The paper now includes a statement regarding potential ethical implications, specifically the risk of biased or sensitive concept labeling from automated vocabulary use.
Assigned Action Editor: ~Shinichi_Nakajima2
Submission Number: 4305
Loading