Abstract: Establishing String Similarity based on pho-
netics has been widely used in information re-
trieval systems to identify differently spelled
but similar-sounding words. Another common
application often involves calculating a similar-
ity score between two words coming from two
different sources which possibly can be two
different spelling representations of the same
word. A very interesting and common subset of
this is estimating the phonetic similarity of two
words that are transliterated to Roman script
from a different language. For such a use case,
it would be more effective if we can use the
knowledge of the nature of the concerned writ-
ing system from which the words originated as
people usually tend to carry over the nuances of
the underlying writing system during transliter-
ation. We propose Xrphonetic, a novel phonetic
similarity algorithm, for words transliterated to
Roman script from languages using Abugida-
based scripts by treating aksharas as the most
fundamental atomic unit of words with conso-
nant and vowel phonemes as its further sub-
atomic units, and by having weighted phoneme
mappings to get a more continuous spectrum
of phonetic similarity.
Paper Type: Short
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: Phonology, Morphology, Word Segmentation
Languages Studied: Gujararati, Bangla, Hindi, Tamil, Telugu, Kanadda, Sinhala, Marathi
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content
Submission Number: 896
Loading