IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training

Sean Xie; Soroush Vosoughi; Saeed Hassanpour

IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training

Sean Xie, Soroush Vosoughi, Saeed Hassanpour

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Track: Full paper

Keywords: Interpretability, Attention, Simulatability, Faithfulness, Consistency

TL;DR: A framework to directly train a language model's attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency

Abstract: Attention has long served as a foundational technique for generating explanations. With the recent developments made in Explainable AI (XAI), the multi-faceted nature of interpretability has become more apparent. Can attention, as an explanation method, be adapted to meet the diverse needs that our expanded understanding of interpretability demands? In this work, we aim to address this question by introducing \texttt{IvRA}, a framework designed to directly train a language model's attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency. Our extensive experimental analysis demonstrates that \texttt{IvRA} outperforms existing methods in guiding language models to generate explanations that are simulatable, faithful, and consistent, in tandem with their predictions. Furthermore, we perform ablation studies to verify the robustness of \texttt{IvRA} across various experimental settings and to shed light on the interactions among different interpretability criteria.

Copyright PDF: pdf

Submission Number: 80

Loading