Evaluation of Feature-based explanations

The rising need of making Machine Learning (ML) models interpretable, fair and trustworthy has led the research community to come up with better explanations to enable interpretation, validation, and transparency while utilising these models in domains such as healthcare or finance. But, how do we judge whether the explanation is better or not? Different types of explanations require different evaluation metrics to get assessed. The most common type of explanation is feature importance explanations that are usually presented in the form of relative ranking of feature as per their importance in determining model’s output. Another common type is counterfactual explanations that tells us what minimum changes are required in the features that will result in different classification by the model.

For black-box classifiers Feature importance and counterfactuals are estimated using different techniques that involves either training an interpretable classifier to mimic black-box locally, or perturbations based on insertion and removal of features. A set of evaluation metrics are required to represent the faithfulness of explanations to truly represent the black-box model that is usually termed as “robustness” of explanations.

Contribution of the Paper

This blog post describes contribution of the paper titled “EVALUATIONS AND METHODS FOR EXPLANATION THROUGH ROBUSTNESS ANALYSIS” by Cheng et. al that discusses assessing robustness in a novel way and coming up with more robust explanation in the specific area of insertion and removal of explanations. This is because such explanations face two drawbacks:

  1. When the feature importance is estimated by removing a feature by setting it to a baseline value, it has higher chance to attribute high importance if some values deflect a lot from baseline. An example would be setting RGB pixels to black, that will give bright pixels more importance.
  2. When the feature importance is estimated by removing a feature by giving it some value sampled from the distribution (using a generative model) , there is an inherent bias that goes from the generative model to this process and not all domains can have a proper generative models.

To address this, the paper contributes towards two things:

  1. Proposing a novel evaluation criteria to assess the robustness of feature based explanations - feature importance and counterfactuals - based on small perturbations to understand how the model’s output changes with these small perturbations on different set of features.
  2. Optimising on the novel evaluation criteria to come up with better explanations

Robustness:

In order to have an evaluation of the explanations, two key things are assumed:

  1. Changing the values of only non-important features have a weak influence on model’s output
  2. Changing the values of only important features can easily change model’s output

Based on the above set of assumptions, a robustness parameter ε* is defined as given by the following equation:

ε</sup>xs = *g(f,x, S) = minδ|δ| *s.t. f(x+δ) != y, δS*

In the above equation, f id the model, x is the input, U is the set of all features and S is the subset of U. The term δ is the minimum adversarial perturbation that is performed on S. This minimum perturbation, when done on set of important features, should be low as per assumption 2. Similarly, minimising δ over a set of non-important features Sc should give a higher δ value, as high pertubations are required to change model’s output by perturbing only non-important features.

Based on this, we can say: R(S) - where S is a set of important features - is given by ε*xs and,

R(Sc); where Sc is a set of non-important features - is given by ε*xsc.

With this evaluation metric R, we can also have a look at the AUC curve that is plotted against the top K features belonging to that subset.

AUC AUC

Counterfactual flavor

If we optimise the previous equation of ε*xs from that to the following equation:

ε</sup>xs = *g(f,x, S) = minδ|δ| *s.t. f(x+δ) = t, δS*

we can see that if the optimisation function can optimise for pertubations leading to another desired class t, it can provide us with the counterfactual use of the S subset of features.

Extracting explanations

Based on g(f,x, S), we can extract a set of important and non important features by solving the follwing set of optimsation problems respectively:

minimise g(f,x, S) s.t. S <= K
maximise g(f,x, Sc) s.t. Sc <= K `

where K is the number of features we intend to analyse or consider.

The above equations could be solved by a greedy approach where we initialise an empty set S (or Sc) and keep on adding features that most optimises the corresponding optimisation function. However, there is a drawback of missing the interaction among features. Two feature might be very important when put together and not important in a standalone manner. For this, marginal contribution of a feature is also taken into consideration by analysing the change in model’s output when unchosen features are also included with this feature. The concept is based on game theory and can be used to optimally decide the contribution of feature on model’s output.

To avoid confusion, this evaluation criteria and explanations is different from SHAP in a way that SHAP considers removal of features by setting it to a baseline value, whereas here we are more interested in capturing change in model’s output by slightly changing the input space from their original value. This change can then be used for optimisation to have more deflection in output (for important features) or less deflection (for non-important features).