Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification

Published: 22 Jan 2025, Last Modified: 12 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio Spectrogram Transformer, Token Pruning
TL;DR: Token pruning in ViT for audio classification retains both low-intensity background tokens and high-intensity signal tokens and both of them contribute to classification accuracy.
Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks. To reduce the high computational cost of ViTs, token pruning has been proposed to selectively remove tokens that are not crucial. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges. In audio processing, distinguishing relevant from non-relevant regions is less straightforward. In this study, we applied token pruning to a ViT-based audio classification model using Mel-spectrograms and analyzed the trade-offs between model performance and computational cost. We show AudioMAE-TopK model can reduce MAC operations by $2\times$ with less than a 1\% decrease in accuracy for both speech command recognition and environmental sound classification. Notably, while many tokens from signal (high-intensity) regions were pruned, tokens from background (low-intensity) regions were frequently retained, indicating the model’s reliance on these regions. In the ablation study, forcing the model to focus only on signal (high-intensity) regions led to lower accuracy, suggesting that background (low-intensity) regions contain unique, irreplaceable information for AudioMAE. In Addition, we find that when token pruning is applied, the supervised pre-trained AST model emphasizes tokens from signal regions more than AudioMAE. **Notice**: This submission is being withdrawn due to errors in our token pruning analysis, which led to inaccurate claims regarding the retention of signal versus background tokens in AudioMAE-TopK. We are currently contacting the program chairs to coordinate the withdrawal process. We deeply apologize for any inconvenience our error have caused.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9332
Loading