Effective Crowdsourcing in the EDT Project with Probabilistic Indexes

Joan-Andreu Sánchez, Enrique Vidal, Vicente Bosch

Published: 2022, Last Modified: 06 Jul 2023DAS 2022Readers: Everyone

Abstract: Many massive handwritten document images collections are available in archives and libraries worldwide with their textual contents being practically inaccessible. Automatic transcription results generally lack the level of accuracy needed for reliable text indexing and search purposes if the recognition systems are not trained with enough training data. Creating training data is expensive and time-consuming. The European Digital Treasures project intended to explore crowdsourcing techniques for producing accurate training data. This paper explores crowdsourcing techniques based on Probabilistic Indexes. A crowdsourcing tool was developed in which volunteers could amend incorrectly transcribed words. Confidence measures were used to guide and help the users in the correction process. In further steps, this new corrected data will be used to re-train the Probabilistic Indexing system.

0 Replies