## Description

In a technology-forward world, sometimes the best and easiest tools are still pen and paper. Organic chemists frequently draw out molecular work with the Skeletal formula, a structural notation used for centuries. Recent publications are also annotated with machine-readable chemical descriptions (InChI), but there are decades of scanned documents that can't be automatically searched for specific chemical depictions. Automated recognition of optical chemical structures, with the help of machine learning, could speed up research and development efforts.

Unfortunately, most public data sets are too small to support modern machine learning models. Existing tools produce 90% accuracy but only under optimal conditions. Historical sources often have some level of image corruption, which reduces performance to near zero. In these cases, time-consuming, manual work is required to reliably convert scanned chemical structure images into a machine-readable format.

Bristol-Myers Squibb is a global biopharmaceutical company working to transform patients' lives through science. Their mission is to discover, develop, and deliver innovative medicines that help patients prevail over serious diseases.

In this competition, you'll interpret old chemical images. With access to a large set of synthetic image data generated by Bristol-Myers Squibb, you'll convert images back to the underlying chemical structure annotated as InChI text.

Tools to curate chemistry literature would be a significant benefit to researchers. If successful, you'll help chemists expand access to collective chemical research. In turn, this would speed up research and development efforts in many key fields by avoiding repetition of previously published chemistries and identifying novel trends via mining large data sets.

Photo by Terry Vlisidis on Unsplash

## Evaluation

Submissions are evaluated on the mean [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) between the InChi strings you submit and the ground truth InChi values.

## Submission File

For each `image_id` in the test set, you must predict the InChi string of the molecule in the corresponding image. The file should contain a header and have the following format:

```
image_id,InChI
00000d2a601c,InChI=1S/H2O/h1H2
00001f7fc849,InChI=1S/H2O/h1H2
000037687605,InChI=1S/H2O/h1H2
etc.
```

## Timeline

Update May 28, 2021. The competition deadline has been extended 24 hours from June 2, 2021 at 11:59 pm UTC to June 3, 2021 at 11:59pm UTC. See [this forum post](https://www.kaggle.com/c/bms-molecular-translation/discussion/242403) for additional details.

- March 2, 2021 -  Competition Start Date

- May 26, 2021 - Entry deadline. You must accept the competition rules before this date in order to compete.

- May 26, 2021 - Team Merger deadline. This is the last day participants may join or merge teams.

- June 3, 2021 - Final submission deadline.

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

## Prizes

- 1st Place - $25,000
- 2nd Place - $15,000
- 3rd Place - $10,000

## Citation

Addison Howard, inversion, Jacob Albrecht, Yvette. (2021). Bristol-Myers Squibb -- Molecular Translation. Kaggle. https://kaggle.com/competitions/bms-molecular-translation

# Data

## Dataset Description

In this competition, you are provided with images of chemicals, with the objective of predicting the corresponding [International Chemical Identifier](https://en.wikipedia.org/wiki/International_Chemical_Identifier) (InChI) text string of the image. The images provided (both in the training data as well as the test data) may be rotated to different angles, be at various resolutions, and have different noise levels.

Note: There are about 4m total images in this dataset. Unzipping the downloaded data will take a non-trivial amount of time.

## Files

- **train/** - the training images, arranged in a 3-level folder structure by `image_id`
- **test/** - the test images, arranged in the same folder structure as `train/`
- **train_labels.csv** - ground truth InChi labels for the training images
- **sample_submission.csv** - a sample submission file in the correct format
