Evaluating Cross-Language Information Retrieval Models on Indonesian–Arabic Fiqh Texts: A Case Study
Keywords: Cross-Language Information Retrieval, Fiqh, Dense Retrieval, Sparse Retrieval, Reciprocal Rank Fusion, Domain Adaptation, Synthetic Data Generation
TL;DR: Evaluating lexical, dense, and hybrid retrieval pipelines to search classical Arabic Fiqh texts using Indonesian queries.
Abstract: Cross-Language Information Retrieval (CLIR) for highly specialized domains, such as querying classical Arabic jurisprudence (Fiqh) using Indonesian, presents severe vocabulary mismatch and zero-resource training challenges. To resolve this lexical mismatch, we demonstrate that LLM-prompted domain-aware translation successfully captures strict legal terminology where standard machine translation fails. Concurrently, to address the absence of human relevance judgments, we employed the JH-POLO framework to generate synthetic in-domain triplets for fine-tuning a multilingual dense retriever. By synergizing these context-aware sparse signals with the semantic reasoning of the dense bi-encoder via Reciprocal Rank Fusion (RRF), we propose a highly effective hybrid architecture. Empirical evaluations on an expert-curated test collection reveal that while the lexical baseline dominates short queries, this late-fusion pipeline achieves the highest overall accuracy and acts as a robust safety net that consistently maximizes recall for complex, verbose queries. Code available at https://github.com/syifaurrr/MusIML26-Fiqh-CLIR
Track: Track 1: ML Research Addressing Challenges Faced by Muslim Communities
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 20
Loading