AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Zili Wang; Qi Yang; LinSu Shi; JiaZhong Yu; Qinghua Liang; Fei Li; Shiming Xiang

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Zili Wang, Qi Yang, LinSu Shi, JiaZhong Yu, Qinghua Liang, Fei Li, Shiming Xiang

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio-Visual Segmentation, real-time model, efficient architecture

TL;DR: We present AVESFormer, the first real-time transformer for audio-visual segmentation, overcoming attention dissipation and inefficient decoding. AVESFormer significantly improves the trade-off between performance and speed.

Abstract: Recently, Transformer-based models have performed remarkably well in audio-visual segmentation (AVS) tasks. However, previous methods exhibit abnormal behavior and unsatisfactory results when using cross-attention. By analyzing attention maps, we identify two primary challenges in existing AVS models: 1) attention dissipation, caused by anomalous attention weights after Softmax over limited frames, and 2) narrow attention patterns in early decoder stages lead to inefficient utilization of attention mechanism. In this paper, we introduce AVESFormer, the first real-time audio-visual segmentation transformer that simultaneously achieves fast, efficient, and lightweight. Our model proposes an efficient, prompt query generator to rectify cross-attention behavior. Moreover, we propose an early focus (ELF) decoder, which enhances efficiency by incorporating convolution operations tailored for local feature extraction, thus reducing computational overhead. Extensive experiments demonstrate that AVESFormer effectively mitigates cross-attention issues, substantially improves attention utilization, and outperforms the previous state-of-the-art, achieving a superior trade-off between performance and speed. The code can be found in the supplementary material.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5444

Loading