Keywords: Learning to Reject, Explainable AI, Explanation quality metrics, Human-annotated data
TL;DR: We introduce a framework for learning to reject low-quality explanations in which predictors are equipped with a rejector that evaluates the quality of explanations and propose ULER, which learns a simple rejector to mirror human judgments.
Abstract: Machine Learning predictors are increasingly being employed in high-stakes applications such as credit scoring. Explanations help users unpack the reasons behind their predictions, but are not always ``high quality". That is, end-users may have difficulty interpreting or believing them, which can complicate trust assessment and downstream decision-making. We argue that classifiers should have the option to refuse handling inputs whose predictions cannot be explained properly and introduce a framework for learning to reject low-quality explanations (LtX) in which predictors are equipped with a rejector that evaluates the quality of explanations. In this problem setting, the key challenges are how to properly define and assess explanation quality and how to design a suitable rejector. Focusing on popular attribution techniques, we introduce ULER (User-centric Low-quality Explanation Rejector), which learns a simple rejector from human ratings and per-feature relevance judgments to mirror human judgments of explanation quality. Our experiments show that ULER outperforms both state-of-the-art and explanation-aware learning to reject strategies at LtX on eight classification and regression benchmarks and on a new human-annotated dataset, which we publicly release to support future research.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 25131
Loading