Abstract: Concept Bottleneck Models (CBMs) (Koh et al., 2020) are a class of interpretable deep learning frameworks that improve transparency by mapping input data into human-understandable concepts. Recent advances, including the Discover-then-Name CBM proposed by Rao et al. (2024), eliminate reliance on external language models by automating concept discovery and naming using a CLIP feature extractor and sparse autoencoder. This study focuses on replicating the key findings reported by Rao et al. (2024). We conclude that the core conceptual ideas are reproducible, but not to the extent presented in the original work. Many representations of active neurons appear to be misaligned with their assigned concepts, indicating a lack of faithfulness of the DN-CBM’s explanations. To address this, we propose a model extension: an enhanced alignment method that we evaluate through a user study. Our extended model provides more interpretable concepts (with statistical significance), at the cost of a slight decrease in accuracy.
Certifications: Reproducibility Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/daniuyter/DNCBM-repro
Supplementary Material: zip
Assigned Action Editor: ~Sungsoo_Ahn1
Submission Number: 4302
Loading