Abstract: Efforts by national libraries, institutions, and (inter-) national
projects have led to an increased effort in preserving textual contents - including
non-digitally born data - for future generations. These activities have resulted
in novel initiatives in preserving the cultural heritage by digitization. However, a
systematic approach toward Textual Data Denoising (TD2) is still in its infancy
and commonly limited to a primarily dominant language (mostly English). However,
digital preservation requires a universal approach. To this end, we introduce
a “Framework for Enabling Textual Data Denoising via robust contextual embeddings”
(FETD2). FETD2 improves data quality by training language-specific data
denoising models based on a small number of language-specific training data. Our
approach employs a bi-directional language modeling in order to produce noiseresilient
deep contextualized embeddings. In experiments we show the superiority
compared with the state-of-the-art.
0 Replies
Loading