IFI: Interpreting for Improving: A Multimodal Transformer with an Interpretability Technique for Recognition of Risk Events

Rupayan Mallick, Jenny Benois-Pineau, Akka Zemmari

Published: 01 Jan 2024, Last Modified: 15 Feb 2025MMM (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Methods of Explainable AI (XAI) are popular for understanding the features and decisions of neural networks. Transformers used for single modalities such as videos, texts, or signals as well as multi-modal data can be considered as a state-of-the-art model for various tasks such as classification, detection, segmentation, etc. as they generalize better than conventional CNNs. The use of feature selection methods using interpretability techniques can be exciting to train the transformer models. This work proposes the use of an interpretability method based on attention gradients to highlight important attention weights along the training iterations. This guides the transformer parameters to evolve in a more optimal direction. This work considers a multimodal transformer on multimodal data: video and sensors. First studied on the video part of multimodal data, this strategy is applied to the sensor data in the proposed multimodal transformer architecture before fusion. We show that the late fusion via a combined loss from both modalities outperforms single-modality results. The target application of this approach is Multimedia in Health for the detection of risk situations for frail adults in the @home environment from the wearable video and sensor data (BIRDS dataset). We also benchmark our approach on the publicly available single-video Kinetics-400 dataset to assess the performance, which is indeed better than the state-of-the-art.