Bridging the Gap: Adapting LLMs for Southeast Asian Low-Resource Machine Translation via Hierarchical Dynamic Retrieval and Matching
Abstract: Retrieval-Augmented Generation (RAG) has proven its effectiveness in enhancing the generation capabilities of large language models (LLMs) for various natural language processing tasks. However, its ability in low-resource machine translation drops sharply due to the noise interference caused by the semantic mismatch between retrieved content and translation requirements. To alleviate this drawback, we propose a novel hierarchical dynamic retrieval and matching approach for Southeast Asian low-resource machine translation. First, we construct a hierarchical index structure that utilizes high-frequency word statistics as key indices based on an existing parallel corpus, associating bilingual short and long sentence pairs. Second, we dynamically match words between the source sentence and the hierarchical index structure to retrieve all associated short and long bilingual sentence pairs. Meanwhile, we rerank the candidate samples by computing cross-lingual semantic similarity between the source sentence and the retrieved pairs. Finally, the sample with the highest semantic similarity is integrated into the prompt to guide LLMs in generating more accurate translations. Experimental results show that our approach outperforms mainstream machine translation systems without fine-tuning LLM parameters. Detailed analysis indicates that our method precisely matches fine-grained semantic information, thus reducing noise interference and improving low-resource translation performance.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: few-shot/zero-shot MT; re-ranking; resources for less-resourced languages; multilingual MT
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Chinese, Vietnamese, Burmese, Indonesian, Malay
Keywords: few-shot/zero-shot MT; re-ranking; resources for less-resourced languages; multilingual MT
Submission Number: 5442
Loading