Abstract: Social media is notoriously difficult to process for existing natural language processing
tools, because of spelling errors, non-standard
words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly
English. In this paper, we are the first to propose a model for cross-lingual normalization,
with which we participate in the WNUT 2021
shared task. To this end, we use MoNoise
as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies
the input. Furthermore, we explore a completely different model which converts the task
to a sequence labeling task. Performance of
this second system is low, as it does not take
capitalization into account in our implementation
0 Replies
Loading