Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition

Published: 01 Jan 2023, Last Modified: 05 Mar 2025MMM (2) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio-visual person recognition is the problem of recognizing an individual person class defined by the training data from the multimodal audio-visual data. Audio-visual person recognition has many applications in security, surveillance, biometrics etc. Deep learning-based audio-visual person recognition report state-of-the-art person recognition accuracy. However, existing audio-visual frameworks require the presence of both modalities, and this approach is limited by the problem of missing modalities, where one or more of the modalities could be missing. In this paper, we formulate an audio-visual person recognition framework where we define and address the missing visual modality problem. The proposed framework enhances the robustness of audio-visual person recognition even under the condition of missing visual modality using audio-based person attributes and a multi-head attention transformer-based network, termed the CNN Transformer Network (CTNet). The audio-based person attributes such as age, gender and race are predicted from the audio data using a deep learning model, termed the Speech-to-Attribute Network (S2A network). The attributes predicted from the audio data, which are assumed to be always available, provide additional cues for the person recognition framework. The predicted attributes, the audio data and the image data, which may be missing, are given as input to the CTNet, which contains the multi-head attention branch. The multi-head attention branch addresses the problem of missing visual modality by assigning attention weights to the audio features, visual features and the audio-based attributes. The proposed framework is validated with the CREMA-D public dataset using a comparative analysis and an ablation study. The results show that the proposed framework enhances the robustness of person recognition even under the condition of missing visible camera.
Loading