Low-Resourced Phonetic and Prosodic Feature Estimation With Self-Supervised-Learning-based Acoustic Modeling

Kiyoshi Kurihara, Masanori Sano

Published: 2024, Last Modified: 14 May 2026ICASSP Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose a method of phonetic and prosodic feature estimation from speech that uses self-supervised-learning (SSL)-based acoustic modeling (AM). Due to the small amount of prosodic feature data, we use SSL for few-shot learning-based speech recognition. Prosodic features allow the symbolization of accent information in pitch-accent languages, which is important information for pronunciation. This method automatically generates labeled data of text-to-speech for pitch-accented language from speech only. In contrast, conventional methods can recognize only pitch accents in phonetic and prosodic features and often have low character error rates. Our method combines wav2vec 2.0, an SSL-based AM method with the Transformer architecture commonly used in natural language processing for correcting phonetic-confusion errors. The experiment indicates that our proposed method brings a 4.7%-character error rate with an SSL-based acoustic modeling with 5.69 hours fine-tuning data and phoneme-error-correction Transformer.

External IDs:dblp:conf/icassp/KuriharaS24