Toward a Phonetic Approach for Multi-Dialect Speech Recognition in Vietnamese

Toward a Phonetic Approach for Multi-Dialect Speech Recognition in Vietnamese

ACL ARR 2026 January Submission3329 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vietnamese, Automatic Speech Recognition, Multi-dialect Speech Recognition, Transformer

Abstract: Vietnamese automatic speech recognition (ASR) remains challenging due to systematic dialectal variation across Northern, Central, and Southern regions, where identical lexical items often exhibit substantially different pronunciations. Most existing approaches address this variability primarily at the word level, relying on vocabularies that implicitly assume dialect-invariant mappings between orthography and pronunciation, which is linguistically inappropriate for Vietnamese. In this work, we propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. We introduce a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations. Building on this representation, we design a phonetic-structure decoder that jointly predicts these components, enabling consistent and interpretable modeling. Experiments on the ViMD dataset demonstrate that the proposed approach consistently outperforms or matches strong pretrained baselines across dialects, achieving a WER of 13.35\%, a PER of 8.45\%, and dialect identification accuracy exceeding 95\%, while using fewer parameters and no external pretraining. We will release code and phonetic resources for experimental reproducibility upon the acceptance of this paper.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: model architectures, multi-task learning, dialects and language varieties, linguistic variation, phonology, grapheme-to-phoneme conversion, pronunciation modeling, evaluation methodologies, automatic speech recognition

Contribution Types: NLP engineering experiment

Languages Studied: Vietnamese

Submission Number: 3329

Loading