Keywords: Vietnamese, Automatic Speech Recognition, Multi-dialect Speech Recognition, Transformer
Abstract: Vietnamese automatic speech recognition (ASR) remains challenging due to systematic dialectal variation across Northern, Central, and Southern regions, where identical lexical items often exhibit substantially different pronunciations. Most existing approaches address this variability primarily at the word level, relying on vocabularies that implicitly assume dialect-invariant mappings between orthography and pronunciation, which is linguistically inappropriate for Vietnamese.
In this work, we propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. We introduce a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations. Building on this representation, we design a phonetic-structure decoder that jointly predicts these components, enabling consistent and interpretable modeling.
Experiments on the ViMD dataset demonstrate that the proposed approach consistently outperforms or matches strong pretrained baselines across dialects, achieving a WER of 13.35\%, a PER of 8.45\%, and dialect identification accuracy exceeding 95\%, while using fewer parameters and no external pretraining. We will release code and phonetic resources for experimental reproducibility upon the acceptance of this paper.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: model architectures, multi-task learning, dialects and language varieties, linguistic variation, phonology, grapheme-to-phoneme conversion, pronunciation modeling, evaluation methodologies, automatic speech recognition
Contribution Types: NLP engineering experiment
Languages Studied: Vietnamese
Submission Number: 3329
Loading