Using Self-Supervised Learning of Birdsong for Downstream Industrial Audio Classification

Patty Ryan; Sean Takafuji; Chenhao Yang; Nile Wilson; Christopher McBride

Using Self-Supervised Learning of Birdsong for Downstream Industrial Audio Classification

Patty Ryan, Sean Takafuji, Chenhao Yang, Nile Wilson, Christopher McBride

Published: 02 Jul 2020, Last Modified: 05 May 2023SAS 2020Readers: Everyone

Keywords: industrial audio, pitch, self-supervised learning, birdsong audio, deep learning

TL;DR: We investigate using self-supervised learning with a dataset of pitch intensive birdsong, combined with select data augmentation for downstream motorized audio classification.

Abstract: In manufacturing settings, workers rely on their sense of hearing, and their knowledge of what sounds correct to help them identify machine quality problems based on the sound pitch, rhythm, timbre and other characteristics of machine operation. Using Machine Learning to classify these sounds has broad applications to automate the manual quality recognition work currently being done, including automating machine operator training, automating quality control detection, and diagnostics across manufacturing and mechanical service industries. We previously established that models taking input pitch information from music domains can dramatically improve classification model performance on industrial machine audio leveraging the CREPE pretrained pitch model. In this work we explore the use of self-supervised learning on pitch-intensive birdsong rather than the CREPE model. To reduce our reliance on a pretrained pitch model and reduce the quantity of labeled industrial audio required, we implement self-supervised representation learning using plentiful, license-free unlabeled, pitch intensive wild birdsong recordings, with audio data augmentation to perform classification on industrial audio. We show that: 1. We can preprocess the unlabeled birdsong data sample with unsupervised methods to eliminate low signal sample and mask low frequency noise leaving just desirable chirp-rich sample. 2. We can identify effective representations and approaches for learning birdsong pitch content by comparing select self-supervised pretext task training of temporal sequence prediction and sequence generation. 3. We can identify effective augmentation methods for learning pitch through comparison of the impact of a variety of audio data augmentation methods on self-supervised learning. And 4. Downstream fine-tuned models deliver improved performance classifying industrial motor audio. We demonstrate that motorized sound classification models using self-supervised learning with a dataset of pitch intensive birdsong, combined with select data augmentation, achieves better results than using the pre-trained CREPE pitch model.

Double Submission: No

4 Replies

Loading