Sentence-Aware Bahnaric-Vietnamese Lexical Mapping with Contrastive Contextual Representations

Published: 14 Dec 2025, Last Modified: 24 Dec 2025LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Bahnaric-Vietnamese lexical mapping, Multilingual Transformer encoder, Community-sourced lexicon, Sentence-aware word representations, Contrastive contextual representations
TL;DR: We show that sentence-aware contrastive representations plus a 10k community lexicon significantly improve Bahnaric-Vietnamese lexical mapping and retrieval-based translation.
Abstract: Underserved and extremely low-resource languages challenge current language technologies, especially when lexical borrowing and synonymy undermine exact-match assumptions. We study Bahnaric-Vietnamese lexical mapping as a step toward meaning-preserving sentence translation. Unlike prior work based on static embeddings and Mean Squared Error (MSE) alignment, we learn sentence-aware word representations with a small multilingual transformer pretrained on Vietnamese, adapt it with Low-rank adaptation (LoRA) for parameter efficiency, and align Bahnaric-Vietnamese pairs using a two-layer projection trained with InfoNCE contrastive loss. We exploit a new community-sourced lexicon of approximately 10,000 Bahnaric-Vietnamese pairs collected with local partners, capturing one-to-one, one-to-many, and many-to-one anchor relations as well as extensive lexical borrowing. Experiments evaluate retrieval-style alignment with Precision at K (P@K) and Mean Reciprocal Rank (MRR), as well as sentence translation using top-1 accuracy, Bilingual Evaluation Understudy (BLEU), Character n-gram F-score (ChrF), and embedding-based BERTScore. We also qualitatively analyze cases where n-gram metrics under-credit semantically adequate outputs in synonym-rich settings, and our ablation analysis shows that InfoNCE contrastive training dramatically outperforms MSE regression. On the $\sim$1k lexicon, our best model attains P@1 $\approx 0.53$ and MRR $\approx 0.62$, substantially improving over a static-embedding MSE baseline, while on the richer $\sim$10k community lexicon it reaches comparable sentence-level top-1 accuracy and BERTScore F1 despite slightly lower BLEU and chrF, highlighting both the benefits of the expanded resource and the remaining challenges of synonym-rich, low-frequency vocabulary.
Submission Number: 27
Loading