LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Haolin Li; Haipeng Zhang; Mang Li; Yaohua Wang; Lijie Wen; zhang yu; Biqing Huang

LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, zhang yu, Biqing Huang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Representation Learning, Natural Language Processing, Information Retrieval, Cross-lingual

TL;DR: LiRA addresses LLM challenges in low-resource languages via representation learning and language-aware reasoning, achieving robust gains across multiple tasks. The project includes a 7-language new dataset and open-sources code.

Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaCR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca’s multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaCR. Code will be released on GitHub and the dataset on Hugging Face.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 7125

Loading