Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE Access 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.
Loading