IMG2InChI: Extracting Molecular Big Data from Chemical Images Using Transformer Models

Published: 01 Jan 2023, Last Modified: 06 Feb 2025GLOBECOM 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Machine learning methods are extensively used to develop new drugs and materials, and molecular big data plays a vital role in this process. A large amount of chemical molecules in documents should be well utilized for the construction of molecular big data. Although optical character recognition technology has been widely applied to extract molecular data from scanned images, the recognition accuracy should be improved because of the complexity and sparsity of molecular structure and the fuzziness of scanned molecular images. In this paper, a novel Transformer-based model is used to automatically extract molecular features from images. Furthermore, the extracted molecular features are translated into InChI descriptors by another Transformer model. The experimental results suggest that the proposed method outperforms other deep learning based methods including ResNet and LSTM (Long Short Term Memory). Moreover, the data extraction process is visualized so that the interpretability of the model could be guaranteed. This is significant in understanding the mechanism of molecule representation. Meanwhile, the results also demonstrate that the proposed method could extract molecular features in a fine granularity, such as atoms and chemical bonds.
Loading