POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Published: 26 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Phone recognition, speech foundation model, speech to text
TL;DR: POWSM is a multitasking speech foundation model focused on phonetic transcription across languages
Presentation Preference: Open to it if recommended by organizers
Abstract: We present **POWSM**, a multitask speech foundation model for phonetic transcription. Trained from scratch on 17k hours of multilingual speech from the IPAPack++ dataset, POWSM jointly learns tasks including phone recognition, ASR, and audio-guided phoneme-to-grapheme (P2G) and grapheme-to-phoneme (G2P) mappings. Preliminary results show that training from scratch outperforms fine-tuning Whisper, and that multitask learning improves both phone error rate (PER) and articulatory feature edit distance (PFER). Future directions include analyzing the benefits and trade-offs of multitask learning, and scaling to additional tasks to further enhance phonetic alignment and generalization.
Submission Number: 27
Loading