Revisiting transposed convolutions for interpreting raw waveform sound event recognition CNNs by sonification

Sarthak Yadav; Mary Ellen Foster

Revisiting transposed convolutions for interpreting raw waveform sound event recognition CNNs by sonification

Sarthak Yadav, Mary Ellen Foster

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: convolutional neural networks, interpretability, sound event recognition, raw waveform, contrastive learning, self-supervised learning, sound classification, audioset

Abstract: The majority of recent work on the interpretability of audio and speech processing deep neural networks (DNNs) interprets spectral information modelled by the first layer, relying solely on visual means of interpretation. In this work, we propose \textit{sonification}, a method to interpret intermediate feature representations of sound event recognition (SER) convolutional neural networks (CNNs) trained on raw waveforms by mapping these representations back into the discrete-time input signal domain, highlighting substructures in the input that maximally activate a feature map as intelligible acoustic events. We use sonifications to compare supervised and self-supervised feature hierarchies and show how sonifications work synergistically with signal processing techniques and visual means of representation, aiding the interpretability of SER models.

One-sentence Summary: We propose an adaptation of "Visualizing and Understanding Convolutional Networks" by Zeiler and Fergus, ECCV 2014, to interpret raw waveform SER CNNs in the audio input space and use it to compare self-supervised and supervised deep representations.

Supplementary Material: zip

14 Replies

Loading