SQS: Speech Quality Assessment in the Data Annotation Context

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Speech quality, audio annotation, subjective measurements, speech intelligibility
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Audio quality plays a crucial role in the data annotation process as it influences various factors that could significantly impact the annotation results. These factors include transcription speed, annotation confidence, and the number of audio replays, among others. Consequently, transcriptions often contain numerous errors and may have blank or incomprehensible sections. Most existing objective measures (e.g., Perceptual Evaluation Score Quality (PESQ), Speech Intelligibility Index (SII)) and subjective measures (e.g., Mean Opinion Score (MOS)), and speech quality measures (e.g., Word Error Rate (WER)) do not consider factors that could hinder the annotation process. These measures poorly correlate with the audio quality perceived by the annotator in the annotation context. We propose a novel subjective speech quality measure within the audio annotation framework, called Speech Quality Score (SQS). This measure encompasses the most relevant characteristics that can impact transcription performance and, consequently, annotation quality. Additionally, we propose a DNN-based model to predict the SQS measure. Our experiments were conducted on a dataset composed of 1,020 audio samples with SQS annotations created specifically for this study, using the RTVE2020 Database. The results demonstrate that our proposed model achieved a high performance with a linear correlation coefficient of 0.8 between ground-truth and predicted SQS values. In contrast, state-of-the-art MOS prediction models exhibited a poor correlation (i.e., 0.2) with ground-truth SQS values.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5441
Loading