Keywords: Phone recognition, speech foundation model, speech to text
TL;DR: POWSM is a multitasking speech foundation model focused on phonetic transcription across languages
Presentation Preference: Open to it if recommended by organizers
Abstract: We present **POWSM**, a multitask speech foundation model for phonetic transcription. Trained from scratch on 17k hours of multilingual speech from the IPAPack++ dataset, POWSM jointly learns tasks including phone recognition, ASR, and audio-guided phoneme-to-grapheme (P2G) and grapheme-to-phoneme (G2P) mappings.
Preliminary results show that training from scratch outperforms fine-tuning Whisper, and that multitask learning improves both phone error rate (PER) and articulatory feature edit distance (PFER). Future directions include analyzing the benefits and trade-offs of multitask learning, and scaling to additional tasks to further enhance phonetic alignment and generalization.
Submission Number: 27
Loading