Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation
Abstract: Audio-Visual Segmentation (AVS) aims to produce pixel-
level masks of sound producing objects in videos, by jointly learning
from audio and visual signals. However, real-world environments are in-
herently dynamic, causing audio and visual distributions to evolve over
time, which challenge existing AVS systems that assume static training
settings. To address this gap, we introduce the first exemplar-free contin-
ual learning benchmark for Audio-Visual Segmentation, comprising four
learning protocols across single-source and multi-source AVS datasets.
We further propose a strong baseline, ATLAS, which uses audio-guided
pre-fusion conditioning to modulate visual feature channels via projected
audio context before cross-modal attention. Finally, we mitigate catas-
trophic forgetting by introducing Low-Rank Anchoring (LRA), which
stabilizes adapted weights based on loss sensitivity. Extensive exper-
iments demonstrate competitive performance across diverse continual
scenarios, establishing a foundation for lifelong audio-visual perception.
Loading