FETD2: A Framework for Enabling Textual Data Denoising via Robust Contextual EmbeddingsDownload PDF

28 Sept 2023 (modified: 28 Sept 2023)OpenReview Archive Direct UploadReaders: Everyone
Abstract: Efforts by national libraries, institutions, and (inter-) national projects have led to an increased effort in preserving textual contents - including non-digitally born data - for future generations. These activities have resulted in novel initiatives in preserving the cultural heritage by digitization. However, a systematic approach toward Textual Data Denoising (TD2) is still in its infancy and commonly limited to a primarily dominant language (mostly English). However, digital preservation requires a universal approach. To this end, we introduce a “Framework for Enabling Textual Data Denoising via robust contextual embeddings” (FETD2). FETD2 improves data quality by training language-specific data denoising models based on a small number of language-specific training data. Our approach employs a bi-directional language modeling in order to produce noiseresilient deep contextualized embeddings. In experiments we show the superiority compared with the state-of-the-art.
0 Replies

Loading