Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Published: 09 Mar 2026, Last Modified: 30 Apr 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Audio-Visual Segmentation (AVS) aims to produce pixel- level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are in- herently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free contin- ual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catas- trophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive exper- iments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception.
Loading