Paper Link: https://openreview.net/forum?id=sYkhFm2kkzj
Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)
Abstract: The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.
Dataset: zip
Presentation Mode: This paper will be presented in person in Seattle
Copyright Consent Signature (type Name Or NA If Not Transferrable): Filip Graliński
Copyright Consent Name And Address: Adam Mickiewicz University, ul. Wieniawskiego 1, Poznań, Poland
0 Replies
Loading