Deep learning models are capable of complex auditory processing tasks such as keyword spotting, genre classification, and audio captioning, yet remain opaque. While several works have explored interpretability of neural networks for computer vision and natural language processing, the audio modality has been largely ignored. In this paper, we study the behavior of the audio CNN encoder used in the contrastively trained language-audio model, CLAP. In the domain of music and human speech sounds, we localize and identify the layers of the network that perform well on tasks of varying complexity, sometimes even outperforming the model's final outputs. Digging deeper, we also localize specific dataset classes to neuron clusters within a layer and analyze a cluster’s contribution to the model’s discriminability for that class. To perform these analyses, we propose an automated framework that can leverage a small dataset of a few thousand samples to evaluate and score neuron clusters for their role in classification. Our findings provide insights into the hierarchical nature of representations in audio CNNs, paving the way for improved interpretability of audio model.
Keywords: interpretability, audio, audio models, neuron-level
TL;DR: Paper about understanding how CLAP CNN model behaves and representation.
Abstract:
Submission Number: 25
Loading