Reproducibility Study of ’SLICE: Stabilized LIME for Consistent Explanations for Image Classification’
Abstract: This paper presents a reproducibility study of SLICE: Stabilized LIME for Consistent Explanations for Image Classification by Bora et al. (2024). SLICE enhances LIME by incorporating Sign Entropy-based Feature Elimination (SEFE) to remove unstable superpixels and an adaptive perturbation strategy using Gaussian blur to improve consistency in feature importance rankings. The original work claims that SLICE significantly improves explanation stability and fidelity. Our study systematically verifies these claims through extensive experimentation using the Oxford-IIIT Pets, PASCAL VOC, and MS COCO datasets. Our results confirm that SLICE achieves higher consistency than LIME, supporting its ability to reduce instability. However, our fidelity analysis challenges the claim of superior performance, as LIME often achieves higher Ground Truth Overlap (GTO) scores, indicating stronger alignment with object segmentations. To further investigate fidelity, we introduce an alternative AOPC evaluation to ensure a fair comparison across methods. Additionally, we propose GRID-LIME, a structured grid-based alternative to LIME, which improves stability while maintaining computational efficiency. Our findings highlight trade-offs in post-hoc explainability methods and emphasize the need for fairer fidelity evaluations. Our implementation is publicly available at our GitHub repository.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewers for their constructive feedback, which has significantly helped improve our manuscript. In response to the reviews, we have made the following key revisions:
1. Writing and Clarity:
1.1 The entire manuscript has been carefully revised for improved clarity, flow, and conciseness.
1.2 The Introduction (Section 1) and Discussion (Section 6) have been substantially updated to:
1.2.1 Provide clearer motivation for the GTO metric and GRID-LIME, explicitly linking them to insights gained during the reproducibility process (e.g., perturbation bias, LIME instability, SLICE computational cost).
1.2.2 Better frame the contributions and trade-offs, particularly regarding GRID-LIME's position as an alternative balancing consistency and efficiency.
1.2.3 Emphasize the critical analysis of SLICE's fidelity claims and highlight methodological findings (e.g., perturbation impact, OOD inputs).
1.3 Specific sections, including Section 4.1.3 (GRID-LIME), Section 4.3 (Hyperparameters), Section 5.4 (AOPC/AUC), and Section 6.2 (Implementation Details), have been clarified regarding formalism, logic, hyperparameter origins, and evaluation procedures.
2. Figures and Tables:
2.1 All figures (Figures 1-14 and Appendix Figures) have been replaced with high-resolution versions for better readability.
2.2 Figure captions (e.g., Figure 2: \sigma=0 meaning; Figure 8: GTO vs k) have been updated for clarity.
2.3 Table captions (Tables 1-4) have been updated to clearly define all columns (e.g., M_{\Delta}, N.C) and ensure completeness.
3. Code Validation and Computational Cost:
3.1 Appendix F has been added, presenting CCM results generated using the original authors' TensorFlow code for SLICE, allowing direct comparison and validation of our PyTorch implementation's consistency results.
3.2 Appendix D now includes Table 4, providing quantitative data on the computational runtime and estimated carbon emissions for LIME, SLICE, and GRID-LIME, supporting claims about GRID-LIME's efficiency relative to SLICE.
4. Methodology and Evaluation Enhancements:
4.1 GRID-LIME (Section 4.1.3): Presentation made more formal with the addition of Equations 2 and 3 and clearer explanation of the underlying logic for consistency improvement.
4.2 AOPC/AUC Analysis (Section 5.4 & Appendix A): Claims based on ECDF plots (Figures 6 & 7) are now statistically supported by Wilcoxon signed-rank test results presented in Tables 1 and 2 (Appendix A). The link between AOPC/AUC/MoRF and the "Insertion/Deletion" evaluation is made explicit. An explanation for p=1 values in Wilcoxon tests is provided in Appendix A.
4.3 Perturbation Methods: Discussion added in Section 6 regarding the inherent assumptions, biases (e.g., OOD inputs), and impact on faithfulness related to different perturbation strategies (masking vs. blur).
4.4 GTO Metric (Section 5.6): Mention added in Section 6 acknowledging the potential value of a complementary recall-based metric.
5. Corrections and Organization:
5.1 Corrected typos (e.g., \sigma, LIME/SLICE references).
5.2 Corrected Equation 1 typo and defined \mathcal{Z} (Section 4.1.1).
5.3 Ensured correct figure numbering and referencing throughout.
5.4 Defined terms like "AdaBlur" on first use.
5.5 Reorganized the Ablation Study (previously Section 5.5), integrating it into Sections 5.2/5.3 where its components (SEFE, AdaBlur) are discussed.
We believe these comprehensive revisions address the points raised by all reviewers and significantly strengthen the quality, rigor, and clarity of our reproducibility study.
Assigned Action Editor: ~Fernando_Perez-Cruz1
Submission Number: 4308
Loading