The rising need of making Machine Learning (ML) models interpretable, fair and trustworthy has led the research community to come up with better explanations to enable interpretation, validation, and transparency while utilising these models in domains such as healthcare or finance. But, how do we judge whether the explanation is better or not? Different types of explanations require different evaluation metrics to get assessed. The most common type of explanation is feature importance explanations that are usually presented in the form of relative ranking of feature as per their importance in determining model’s output. Another common type is counterfactual explanations that tells us what minimum changes are required in the features that will result in different classification by the model.
For black-box classifiers Feature importance and counterfactuals are estimated using different techniques that involves either training an interpretable classifier to mimic black-box locally, or perturbations based on insertion and removal of features. A set of evaluation metrics are required to represent the faithfulness of explanations to truly represent the black-box model that is usually termed as “robustness” of explanations.
This blog post describes contribution of the paper titled “EVALUATIONS AND METHODS FOR EXPLANATION THROUGH ROBUSTNESS ANALYSIS” by Cheng et. al that discusses assessing robustness in a novel way and coming up with more robust explanation in the specific area of insertion and removal of explanations. This is because such explanations face two drawbacks:
To address this, the paper contributes towards two things:
In order to have an evaluation of the explanations, two key things are assumed:
Based on the above set of assumptions, a robustness parameter ε* is defined as given by the following equation:
ε</sup>xs = *g(f,x, S) = minδ|δ| *s.t. f(x+δ) != y, δS*
In the above equation, f id the model, x is the input, U is the set of all features and S is the subset of U. The term δ is the minimum adversarial perturbation that is performed on S. This minimum perturbation, when done on set of important features, should be low as per assumption 2. Similarly, minimising δ over a set of non-important features Sc should give a higher δ value, as high pertubations are required to change model’s output by perturbing only non-important features.
Based on this, we can say: R(S) - where S is a set of important features - is given by ε*xs and,
R(Sc); where Sc is a set of non-important features - is given by ε*xsc.
With this evaluation metric R, we can also have a look at the AUC curve that is plotted against the top K features belonging to that subset.
If we optimise the previous equation of ε*xs from that to the following equation:
ε</sup>xs = *g(f,x, S) = minδ|δ| *s.t. f(x+δ) = t, δS*
we can see that if the optimisation function can optimise for pertubations leading to another desired class t, it can provide us with the counterfactual use of the S subset of features.
Based on g(f,x, S), we can extract a set of important and non important features by solving the follwing set of optimsation problems respectively:
minimise g(f,x, S) s.t. S <= K
maximise g(f,x, Sc) s.t. Sc <= K `
where K is the number of features we intend to analyse or consider.
The above equations could be solved by a greedy approach where we initialise an empty set S (or Sc) and keep on adding features that most optimises the corresponding optimisation function. However, there is a drawback of missing the interaction among features. Two feature might be very important when put together and not important in a standalone manner. For this, marginal contribution of a feature is also taken into consideration by analysing the change in model’s output when unchosen features are also included with this feature. The concept is based on game theory and can be used to optimally decide the contribution of feature on model’s output.
To avoid confusion, this evaluation criteria and explanations is different from SHAP in a way that SHAP considers removal of features by setting it to a baseline value, whereas here we are more interested in capturing change in model’s output by slightly changing the input space from their original value. This change can then be used for optimisation to have more deflection in output (for important features) or less deflection (for non-important features).
-->