Xrphonetic: Akshara-based Phonetic String Similarity

ACL ARR 2024 April Submission896 Authors

16 Apr 2024 (modified: 15 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Establishing String Similarity based on pho- netics has been widely used in information re- trieval systems to identify differently spelled but similar-sounding words. Another common application often involves calculating a similar- ity score between two words coming from two different sources which possibly can be two different spelling representations of the same word. A very interesting and common subset of this is estimating the phonetic similarity of two words that are transliterated to Roman script from a different language. For such a use case, it would be more effective if we can use the knowledge of the nature of the concerned writ- ing system from which the words originated as people usually tend to carry over the nuances of the underlying writing system during transliter- ation. We propose Xrphonetic, a novel phonetic similarity algorithm, for words transliterated to Roman script from languages using Abugida- based scripts by treating aksharas as the most fundamental atomic unit of words with conso- nant and vowel phonemes as its further sub- atomic units, and by having weighted phoneme mappings to get a more continuous spectrum of phonetic similarity.
Paper Type: Short
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: Phonology, Morphology, Word Segmentation
Languages Studied: Gujararati, Bangla, Hindi, Tamil, Telugu, Kanadda, Sinhala, Marathi
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content
Submission Number: 896