Abstract: In this work, we conduct an in-depth analysis of two
frequency-dependent methods for sound event detection (SED):
FilterAugment and frequency dynamic convolution (FDY conv).
The goal is to better understand their characteristics and behaviors
in the context of SED. While SED has been rapidly advancing
through the adoption of various deep learning techniques from
other pattern recognition fields, such adopted techniques are often
not suitable for SED. To address this issue, two frequencydependent
SED methods were previously proposed: FilterAugment,
a data augmentation randomly weighting frequency bands,
and FDY conv, an architecture applying frequency adaptive convolution
kernels. These methods have demonstrated superior performance
in SED, and we aim to further analyze their detailed
effectiveness and characteristics in SED. We compare class-wise
performance to find out specific pros and cons ofFilterAugment and
FDY conv.We apply Gradient-weighted Class ActivationMapping
(Grad-CAM), which highlights time-frequency region that is more
inferred by the model, on SED models with and without frequency
masking and two types of FilterAugment to observe their detailed
characteristics. We propose simpler frequency dependent convolution
methods and compare them with FDY conv to further understand
which components of FDYconv affects SEDperformance.
Lastly,we applyPCAto showhowFDYconv adapts dynamic kernel
across frequency dimensions on different sound event classes. The
results and discussions demonstrate that frequency dependency
plays a significant role in sound event detection and further confirms
the effectiveness of frequency dependent methods on SED.
Loading