CrossMath: Towards Cross-lingual Math Information Retrieval

Published: 07 Jun 2024, Last Modified: 07 Jun 2024ICTIR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Math IR, Cross-lingual Information Retrieval, Technical Documents
TL;DR: This paper introduces the problem of cross-lingual math information retrieval (CLMIR). It presents a novel CLMIR test collection and retrieval system.
Abstract: Current math search engines and test collections are primarily developed for the English language, limiting their accessibility and inclusivity. This paper introduces cross-lingual math information retrieval (CLMIR) to overcome this limitation, focusing on retrieving mathematical information across languages. The paper presents CrossMath, a novel CLMIR test collection comprising manually translated topics in four languages (Croatian, Czech, Persian, and Spanish). Additionally, a CLMIR system leveraging state-of-the-art translation models (mBART and NLLB) alongside a formula masking approach to handle mathematical notation is introduced. Evaluation results on the ARQMath test collections show the effectiveness of the proposed CLMIR system, indicating competitive effectiveness against using English topics for all four languages.
Submission Number: 43