Keywords: CNN, convolutional neural network; t-SNE, t-distributed Stochastic Neighbor Embedding; UMAP, Uniform Manifold Approximation and Projection; MSR, masked spectrogram reconstruction; ASR, automatic speech recognition.
Abstract: The increasing reliance on voice-based applications underscores the importance of accent recognition and similarity assessment in enhancing automatic speech recognition (ASR) systems. Despite notable progress, ASR systems face challenges in recognising non-native English accents, especially low-resource varieties such as Ghanaian English, due to data scarcity and limited model representation. These accents often contribute to reduced recognition accuracy and elevated error rates, affecting system inclusivity.
This study introduces a convolutional neural network-based Masked Spectrogram Reconstruction (CNN-MSR) model for accent classification and similarity assessment between native and non-native English speakers. The proposed approach applies random spectrogram masking to enhance the extraction of robust, accent-specific features while mitigating overfitting and background noise. A Much Lower Frame Rate (mLFR) strategy is incorporated to optimise computational efficiency without compromising acoustic fidelity.
Experimental evaluations on the AccentDB and Ghanaian English datasets demonstrate the model's enhanced performance, achieving an accuracy of 90.71\%, significantly surpassing the baseline of 81.88\%. Furthermore, the model effectively reduces word error rates to 0.00 for native accents and 19.91 for non-native accents. Visualisations using t-SNE and UMAP illustrate apparent clustering of native accents and overlapping patterns among non-native speakers, reflecting greater phonetic diversity. Cosine similarity analysis further reveals the model's capacity to capture nuanced intra- and inter-group accent relationships, demonstrating its potential for improving ASR inclusivity.
Submission Number: 12
Loading