ViPhoVQA: Toward a Phonemic-Based Method for Mitigating Rare and Out-of-Vocab Words in Vietnamese Text-Based Visual Question Answering
Abstract: Text-based VQA a challenging task that requires the scene text reading ability of machines, which proposes more challenges to the VQA task. Recent advanced Text-based VQA methods have T5 as their backbone to construct answers via giving subwords step-by-step. In this study, we show that Text-based VQA datasets in Vietnamese suffer from the rare and Out-of-Vocab (OOV) word problem. Most of their answers are proper names, addresses, and numbers. To this end, we introduce ViPhoVQA (Vietnamese Phonemic-based Visual Question Answering), a novel transformer-based architecture that exploits the high phonetic-orthography feature of Vietnamese to construct answers from phonemes rather than subwords in Vietnamese. Accordingly, ViPhoFormer can theoretically construct unlimited numbers of Vietnamese words, hence mitigating the rare and OOV word challenges. Our experiments on three large-scale Vietnamese Text-based VQA datasets show that our proposed ViPhoFormer obtained State-of-the-Art (SotA) in both F1-token and EM scores on all datasets.
External IDs:dblp:conf/iccci/NguyenNTDQN25
Loading