The "ScribbleLens" Dutch Historical Handwriting Corpus

Hans J. G. A. Dolfing, Jerome R. Bellegarda, Jan Chorowski, Ricard Marxer, Antoine Laurent

Published: 2020, Last Modified: 10 Apr 2024ICFHR 2020Readers: Everyone

Abstract: Historical handwritten documents guard an important part of human knowledge only at the reach of a few scholars and experts. Recent developments in machine learning have the potential of rendering this information accessible to a larger audience. Data-driven approaches to automatic manuscript recognition require large amounts of transcribed scans to work. To this end, we introduce a new handwritten corpus based on 400-year-old, cursive, early modern Dutch documents such as ship journals and daily logbooks. This is a 1000 page collection, segmented into lines, to facilitate fully-, weakly- and un-supervised research and with textual transcriptions on 20% of the pages. Other annotations such as handwriting slant, year of origin, complexity, and writer identity have been manually added. With over 80 writers this corpus is significantly larger and more varied than other existing historical data sets such as Spanish RODRIGO. We provide train/test splits, experimental results from an automatic transcription baseline and tools to facilitate its use in deep learning research. The manuscripts span over 150 years of significant journeys by captains and traders from the Vereenigde Oost-indische Company (VOC) such as Tasman, Brouwer and Van Neck, making this resource also valuable to historians and the paleography community.

0 Replies