Generating Pseudo-Strong Labels from Weak Labels for Distributed Multi-Microphone Sound Event Detection
Abstract: Annotating frame-level strong labels for training a distributed multi-microphone sound event detection model to both recognize and temporally localize sound events within sequences is difficult and requires considerable time and effort. For a given distributed multi-microphone data, the strong labels identify the sound event categories in the environment along with their start and end times. Conversely, annotating sequence-level weak labels, which denote only the presence of sound events in the multi-microphone data without temporal information, is simpler. However, employing weak labels to train a distributed multi-microphone sound event detection model presents challenges. In this study, we propose leveraging weak labels within a distributed multi-microphone sound event detection framework to identify and temporally locate sound events across the multiple microphones. Our approach initially generates pseudo-strong labels for the distributed multi-microphones using the weak labels provided. Subsequently, a latent embedding estimation model for the audio data are learned using the generated pseudo-strong labels. Using transfer learning, the trained latent embedding estimated model are then integrated within a sound event detection model, which identifies and temporally localize sound events. By integrating the latent embedding estimation model that is learnt from the pseudo-strong labels with the sound event detection model, the proposed framework leverages the knowledge from the weak labels and transfers it for the sound event detection. We evaluated the proposed framework on the MM Office dataset and compared it with state-of-the-art baseline algorithms. The experimental results demonstrate that incorporating weak labels within the sound detection framework enhances the event detection accuracy.
Loading