CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition
Keywords: Chemical structure recognition, Chemical image-to-structure translation, Molecular images atom-level instance annotations.
TL;DR: A collection of datasets containing more than 700,000 atom-level entity annotations and their corresponding bounding boxes. This labels constitute all the necessary information for complete chemical graph reconstruction.
Abstract: Optical Chemical Structure Recognition (OCSR) deals with the translation from chemical images to molecular structures, this being the main way chemical compounds are depicted in scientific documents. Traditionally, rule-based methods have followed a framework based on the detection of chemical entities, such as atoms and bonds, followed by a compound structure reconstruction step. Recently, neural architectures analog to image captioning have been explored to solve this task, yet they still show to be data inefficient, using millions of examples just to show performances comparable with traditional methods. Looking to motivate and benchmark new approaches based on atomic-level entities detection and graph reconstruction, we present CEDe, a unique collection of chemical entity bounding boxes manually curated by experts for scientific literature datasets. These annotations combine to more than 700,000 chemical entity bounding boxes with the necessary information for structure reconstruction. Also, a large synthetic dataset containing one million molecular images and annotations is released in order to explore transfer-learning techniques that could help these architectures perform better under low-data regimes. Benchmarks show that detection-reconstruction based models can achieve performances on par with or better than image captioning-like models, even with 100x fewer training examples.
Supplementary Material: pdf
Dataset Url: We provide different options for downloading the CEDe dataset. Image data and annotations can be downloaded separately or as one compressed file. Also, different dataset sizes are provided (every smaller dataset is fully contained in bigger versions). CEDe real data Full (135.7MB): https://storage.googleapis.com/lgcede/CEDe_dataset_v0.2.tar.gz Annotations (194MB): https://storage.googleapis.com/lgcede/CEDe_dataset_v0.2.json Train split annotations (38.5MB): https://storage.googleapis.com/lgcede/CEDe_dataset_finetune_split_v0.2.json Test split annotations (156MB): https://storage.googleapis.com/lgcede/CEDe_dataset_test_split_v0.2.json Images (53.6MB): https://storage.googleapis.com/lgcede/CEDe_dataset_images_v0.2.tar.gz Synthetic data 10K Images Full (334MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_10k.tar.gz Annotations (177MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_10k.json Images (320MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_10k.tar.gz 50K Images Full (1.6GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_50k.tar.gz Annotations (887MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_50k.json Images (1.6GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_50k.tar.gz 100K Images Full (3.3GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_100k.tar.gz Annotations (1.7GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_100k.json Images (3.1GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_100k.tar.gz 1M Images Full (32.5GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_1M.tar.gz Annotations (17.3GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_1M.json Images (31.2GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_1M.tar.gz Complex background image & annotations: Full (9.7MB): https://storage.googleapis.com/lgcede/CEDe_complex_background_v0.1.tar.gz
License: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) https://creativecommons.org/licenses/by-nc/2.0/legalcode
Author Statement: Yes
Contribution Process Agreement: Yes
In Person Attendance: Yes