Keywords: ocr, meta-learning, synthesizer
TL;DR: We introduce OmniPrint, a synthetic data generator of isolated printed characters, geared toward machine learning research.
Abstract: We introduce OmniPrint, a synthetic data generator of isolated printed characters, geared toward machine learning research. It draws inspiration from famous datasets such as MNIST, SVHN and Omniglot, but offers the capability of generating a wide variety of printed characters from various languages, fonts and styles, with customized distortions. We include 935 fonts from 27 scripts and many types of distortions. As a proof of concept, we show various use cases, including an example of meta-learning dataset designed for the upcoming MetaDL NeurIPS 2021 competition. OmniPrint is available at https://github.com/SunHaozhe/OmniPrint.
Supplementary Material: zip
Contribution Process Agreement: Yes
Dataset Url: https://github.com/SunHaozhe/OmniPrint
License: The code of the OmniPrint data synthesizer is licensed under the MIT license (https://opensource.org/licenses/MIT). The datasets OmniPrint-meta[1-5] are licensed under a Creative Commons license CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).
Author Statement: Yes
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 12 code implementations](https://www.catalyzex.com/paper/arxiv:2201.06648/code)