MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks

Published: 20 Jul 2024, Last Modified: 31 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Class-incremental learning poses a significant challenge under an exemplar-free constraint, leading to catastrophic forgetting and sub-par incremental accuracy. Previous attempts have focused primarily on single-modality tasks, such as image classification or audio event classification. However, in the context of Audio-Visual Class-Incremental Learning (AVCIL), the effective integration and utilization of heterogeneous modalities, with their complementary and enhancing characteristics, remains largely unexplored. To bridge this gap, we propose the Multi-Modal Analytic Learning (MMAL) framework, an exemplar-free solution for AVCIL that employs a closed-form, linear approach. To be specific, MMAL introduces a modality fusion module that re-formulates the AVCIL problem through a Recursive Least-Square (RLS) perspective. Complementing this, a Modality-Specific Knowledge Compensation (MSKC) module is designed to further alleviate the under-fitting limitation intrinsic to analytic learning by harnessing individual knowledge from audio and visual modality in tandem. Comprehensive experimental comparisons with existing methods show that our proposed MMAL demonstrates superior performance with the accuracy of 76.71%, 78.98% and 76.19% on AVE, Kinetics-Sounds and VGGSounds100 datasets, respectively, setting new state-of-the-art AVCIL performance. Notably, compared to those memory-based methods, our MMAL, being an exemplar-free approach, provides good data privacy and can better leverage multi-modal information for improved incremental accuracy.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: In the era of large models, we’re seeing an exponential increase in both model parameters and training data, particularly in the multimedia field. Amidst massive data and large models, incremental learning emerges as a prime choice for rapid training and adaptation to new and old tasks, which avoids the resource-intensive joint training. Audio and video are the most common and frequently used data types in multimedia, serving as the foundational modalities for conveying rich sensory experiences in numerous applications. Our work addresses the challenging audio-visual class-incremental learning (AVCIL) by proposing, Multi-Modal Analytic Learning (MMAL), that can adapt to new and old tasks with just one epoch of training, without accessing any data from previous tasks. MMAL introduces a fusion module re-formulating the AVCIL problem through a Recursive Least-Square (RLS) perspective. Complementing this, a Modality-Specific Knowledge Compensation (MSKC) module is further designed to mitigate the under-fitting limitation intrinsic to analytic learning, achieving state-of-the-art performance in AVCIL. Therefore, our work not only tackles the challenge of training large models on massive multimodal datasets but also underscores the importance and effectiveness of incremental learning in multimodal tasks, providing a new paradigm for AVCIL and enabling more robust and adaptable multi-modal systems.
Supplementary Material: zip
Submission Number: 5001
Loading