Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer

Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka

Published: 01 Jan 2023, Last Modified: 21 May 2025INTERSPEECH 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This study investigates praise estimation, the task of estimating the existence of preferable behaviors of a speaker in a conversational video. To estimate praises from multimodal information, considering synchronized behavior across modalities is important. Such cross-modal synchronization can be modeled by the conventional multimodal Transformer in a time-axis concatenation architecture because it models relevance between all time steps of all input modalities using attention matrices. However, the attention matrices are so high-dimensional that the model training can be difficult with a limited amount of training data. To alleviate this problem, we propose introducing a loss function representing the prior knowledge that the attention should link around the synchronized time steps across the input modalities. Our experiments on a business negotiation conversation corpus showed that the proposed method could improve the praise estimation's macro F1.