Keywords: Text Normalization, Sequence-to-Sequence Model, Encoder-Decoder
Abstract: All areas of language and speech technology, directly or indirectly, require handling of real text. In addition to ordinary words and names, the real text contains non-standard words (NSWs), including numbers, abbreviations, dates, currency, amounts, and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary letter-to-sound rules. It is desirable to normalize text by replacing such non-standard words with a consistently formatted and contextually appropriate variant in several NLP applications. To address this challenge, in this paper, we model the problem as character-level sequence-to-sequence learning where we map a sequence of input characters to a sequence of output words. It consists of two neural networks, the encoder network, and the decoder network. The encoder maps the input characters to a fixed dimensional vector and the decoder generates the output words. We have achieved an accuracy of 94.8 % which is promising given the resource we use.
Original Pdf: pdf
4 Replies
Loading