IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0
Track: Full paper
Keywords: Interpretability, Attention, Simulatability, Faithfulness, Consistency
TL;DR: A framework to directly train a language model's attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency
Abstract: Attention has long served as a foundational technique for generating explanations. With the recent developments made in Explainable AI (XAI), the multi-faceted nature of interpretability has become more apparent. Can attention, as an explanation method, be adapted to meet the diverse needs that our expanded understanding of interpretability demands? In this work, we aim to address this question by introducing \texttt{IvRA}, a framework designed to directly train a language model's attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency. Our extensive experimental analysis demonstrates that \texttt{IvRA} outperforms existing methods in guiding language models to generate explanations that are simulatable, faithful, and consistent, in tandem with their predictions. Furthermore, we perform ablation studies to verify the robustness of \texttt{IvRA} across various experimental settings and to shed light on the interactions among different interpretability criteria.
Copyright PDF: pdf
Submission Number: 80
Loading