Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From

ACL ARR 2025 February Submission8306 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. Furthermore, the bottleneck of cross-lingual context retrieval lies at the last transformer layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: cross-lingual transfer, multilingual evaluation, resources for less-resourced languages

Languages Studied: en, de, es, vi, zh, hi, ar, el, ro, ru, th, tr

Submission Number: 8306

Loading