Abstract: Frequency dynamic convolution (FDY conv) has shown the
state-of-the-art performance in sound event detection (SED) using
frequency-adaptive kernels obtained by frequency-varying
combination of basis kernels. However, FDY conv lacks an
explicit mean to diversify frequency-adaptive kernels, potentially
limiting the performance. In addition, size of basis
kernels is limited while time-frequency patterns span larger
spectro-temporal range. Therefore, we propose dilated frequency
dynamic convolution (DFD conv) which diversifies and
expands frequency-adaptive kernels by introducing different dilation
sizes to basis kernels. Experiments showed advantages
of varying dilation sizes along frequency dimension, and analysis
on attention weight variance proved dilated basis kernels
are effectively diversified. By adapting class-wise median filter
with intersection-based F1 score, proposed DFD-CRNN outperforms
FDY-CRNN by 3.12% in terms of polyphonic sound
detection score (PSDS).
Loading