Establishing a Foundation for Tetun Text Ad-Hoc Retrieval: Indexing, Stemming, Retrieval, and Ranking

Gabriel de Jesus, Sérgio Sobral Nunes

Published: 01 Jan 2024, Last Modified: 28 Apr 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Searching for information on the internet and digital platforms to satisfy an information need requires effective retrieval solutions. However, such solutions are not yet available for Tetun, making it challenging to find relevant documents for text-based search queries in this language. To address these challenges, we investigate Tetun text retrieval with a focus on the ad-hoc retrieval task. The study begins by developing essential language resources -- including a list of stopwords, a stemmer, and a test collection -- which serve as foundational components for solutions tailored to Tetun text retrieval. Various strategies are investigated using both document titles and content to evaluate retrieval effectiveness. The results demonstrate that retrieving document titles, after removing hyphens and apostrophes without applying stemming, significantly improves retrieval performance compared to the baseline. Efficiency increases by 31.37%, while effectiveness achieves an average relative gain of +9.40% in MAP@10 and +30.35% in NDCG@10 with DFR BM25. Beyond the top-10 cutoff point, Hiemstra LM shows strong performance across various retrieval strategies and evaluation metrics. Contributions of this work include the development of Labadain-Stopwords (a list of 160 Tetun stopwords), Labadain-Stemmer (a Tetun stemmer with three variants), and Labadain-Avaliad\'or (a Tetun test collection containing 59 topics, 33,550 documents, and 5,900 qrels). We make all resources publicly accessible to facilitate future research in Tetun information retrieval.