Amharic Text Normalization with Sequence-to-Sequence Models

Seifedin Shifaw Mohamed; Solomon Teferra Abate (PhD)

Amharic Text Normalization with Sequence-to-Sequence Models

Seifedin Shifaw Mohamed, Solomon Teferra Abate (PhD)

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Text Normalization, Sequence-to-Sequence Model, Encoder-Decoder

Abstract: All areas of language and speech technology, directly or indirectly, require handling of real text. In addition to ordinary words and names, the real text contains non-standard words (NSWs), including numbers, abbreviations, dates, currency, amounts, and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary letter-to-sound rules. It is desirable to normalize text by replacing such non-standard words with a consistently formatted and contextually appropriate variant in several NLP applications. To address this challenge, in this paper, we model the problem as character-level sequence-to-sequence learning where we map a sequence of input characters to a sequence of output words. It consists of two neural networks, the encoder network, and the decoder network. The encoder maps the input characters to a fixed dimensional vector and the decoder generates the output words. We have achieved an accuracy of 94.8 % which is promising given the resource we use.

Original Pdf: pdf

4 Replies

Loading