CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition
Keywords: Chemical structure recognition, Chemical image-to-structure translation, Molecular images atom-level instance annotations.
TL;DR: A collection of datasets containing more than 700,000 atom-level entity annotations and their corresponding bounding boxes. This labels constitute all the necessary information for complete chemical graph reconstruction.
Abstract: Optical Chemical Structure Recognition (OCSR) deals with the translation from chemical images to molecular structures, this being the main way chemical compounds are depicted in scientific documents. Traditionally, rule-based methods have followed a framework based on the detection of chemical entities, such as atoms and bonds, followed by a compound structure reconstruction step. Recently, neural architectures analog to image captioning have been explored to solve this task, yet they still show to be data inefficient, using millions of examples just to show performances comparable with traditional methods. Looking to motivate and benchmark new approaches based on atomic-level entities detection and graph reconstruction, we present CEDe, a unique collection of chemical entity bounding boxes manually curated by experts for scientific literature datasets. These annotations combine to more than 700,000 chemical entity bounding boxes with the necessary information for structure reconstruction. Also, a large synthetic dataset containing one million molecular images and annotations is released in order to explore transfer-learning techniques that could help these architectures perform better under low-data regimes. Benchmarks show that detection-reconstruction based models can achieve performances on par with or better than image captioning-like models, even with 100x fewer training examples.
Author Statement: Yes
URL: https://storage.googleapis.com/lgcede/CEDe_dataset_v0.2.tar.gz
Dataset Url: We provide different options for downloading the CEDe dataset. Image data and annotations can be downloaded
separately or as one compressed file. Also, different dataset sizes are provided (every smaller dataset
is fully contained in bigger versions).
CEDe real data
Full (135.7MB): https://storage.googleapis.com/lgcede/CEDe_dataset_v0.2.tar.gz
Annotations (194MB): https://storage.googleapis.com/lgcede/CEDe_dataset_v0.2.json
Train split annotations (38.5MB): https://storage.googleapis.com/lgcede/CEDe_dataset_finetune_split_v0.2.json
Test split annotations (156MB): https://storage.googleapis.com/lgcede/CEDe_dataset_test_split_v0.2.json
Images (53.6MB): https://storage.googleapis.com/lgcede/CEDe_dataset_images_v0.2.tar.gz
Synthetic data
10K Images
Full (334MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_10k.tar.gz
Annotations (177MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_10k.json
Images (320MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_10k.tar.gz
50K Images
Full (1.6GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_50k.tar.gz
Annotations (887MB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_50k.json
Images (1.6GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_50k.tar.gz
100K Images
Full (3.3GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_100k.tar.gz
Annotations (1.7GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_100k.json
Images (3.1GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_100k.tar.gz
1M Images
Full (32.5GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_1M.tar.gz
Annotations (17.3GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_data_1M.json
Images (31.2GB): https://storage.googleapis.com/lgcede/CEDe_synthetic_images_1M.tar.gz
Complex background image & annotations:
Full (9.7MB): https://storage.googleapis.com/lgcede/CEDe_complex_background_v0.1.tar.gz
License: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/2.0/legalcode
Supplementary Material: pdf
Contribution Process Agreement: Yes
In Person Attendance: Yes
8 Replies
Loading