Harnessing Unsupervised Word Translation to Address Resource Inequality for Peace and Health

Ashiqur R. KhudaBukhsh, Shriphani Palakodety, Tom M. Mitchell

Published: 2022, Last Modified: 13 Jun 2023SocInfo 2022Readers: Everyone

Abstract: Research geared toward human well-being in developing nations often concentrates on web content written in a world language (e.g., English) and ignores a significant chunk of content written in a poorly resourced yet highly prevalent first language of the region in concern (e.g., Hindi). Such omissions are common due to the sheer mismatch between linguistic resources offered in a world language and its low-resource counterpart. However, during a global pandemic or an imminent war, demand for linguistic resources might get recalibrated. In this work, we focus on the high-resource and low-resource language pair $$\langle en , hi _e \rangle $$ (English, and Romanized Hindi) and present a cross-lingual sampling method that takes example documents in English, and retrieves similar content written in Romanized Hindi, the most popular form of Hindi observed in social media. At the core of our technique is a novel finding that a surprisingly simple constrained nearest-neighbor sampling in polyglot Skip-gram word embedding space can retrieve substantial bilingual lexicons, even from harsh social media data sets. Our cross-lingual sampling method obtains substantial performance improvement in the important domains of detecting peace-seeking, hostility-diffusing hope speech in the context of the 2019 India-Pakistan conflict, and in detecting comments encouraging compliance with COVID-19 guidelines.

0 Replies