Automating data curation for the Finnish national bibliography Fennica

University of Eastern Finland DRDHum 2024 Conference Submission12 Authors

Published: 03 Jun 2024, Last Modified: 15 Aug 2024DRDHum 2024 BestPaperEveryoneRevisionsBibTeXCC BY 4.0
Keywords: digital humanities, workflow, bibliography
Abstract: The consortium Digital History for Literature in Finland (Research Council of Finland, 2022–26) differs from earlier research on Finnish literary history by making use of digital collections and new methods in data science, which enable the use of the collection as a whole.1 Here we present a dedicated bibliographic data science framework tailored for the specific context of the consortium research purposes. We will examine how data science methods can improve our understanding of literary history and how it’s told, and how reliable the information can be. [1,2,3] The Finnish National Bibliography, Fennica, consists of over one million records from 1488 to the present and includes diverse data types such as books, newspapers, maps, and other documents from 1488 to the present day. The source data contains ambiguous information, missing or erroneous entries, however. Any refinement efforts will include context-specific choices that depend on the research use case. We have previously shown how selected subsets of the collection can be refined automatically to support large-scale statistical analyses of book printing during the years 1500-1800[1, 3]. In our present version of the workflow2, we’ve scaled up the previous analyses of 70 thousand records to cover all Fennica records and improved the data curation workflows. To cater to the research objectives of the project, we've integrated signum data and created a focused subset covering the years 1809-1917 for each metadata category. Furthermore, we've added a genre subfield derived from a broader leader field to enable genre identification at a bibliographic level, specifically focusing on books. Our aim has been to replicate a curated list, adhering to predefined criteria such as UDC classification, language, genre, and signum data. This automated process mirrors and supports the manual list creation method utilized by the Literary History subproject within the consortium. Future efforts could encompass the integration of further complementary sources, such as the rich information on authors, publishers, and geographic places available in the public domain. These efforts contribute to achieving the consortium's goals, which involve leveraging digital collections and methodologies to broaden the conventional understanding of Finnish literary history. Specifically, we aim to map Finnish and Swedish language fiction from the 19th century into an enriched format conducive to large-scale statistical analyses and the development of reproducible data science workflows.
Submission Number: 12
Loading