Spoken document retrieval for an unwritten language

ACL ARR 2025 February Submission3663 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati -- an unwritten language -- that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: resources for less-resourced languages, minoritized languages
Contribution Types: Approaches to low-resource settings
Languages Studied: Gormati, Gujarati, Hindi, Marathi, Odia, Tamil, Telugu
Submission Number: 3663
Loading