# Molecular Translation

In this competition, you'll interpret old chemical images. With access to a large set of synthetic image data generated by Bristol-Myers Squibb, your task is to convert these images back to the underlying chemical structure annotated as InChI text. The submissions are evaluated using the mean Levenshtein distance between your output InChI strings and the ground truth, with the final score computed as:

```math
1 - \text{mean Levenshtein distance}
```

A higher score indicates a better performance.

---

## Problem Overview

The challenge is centered on converting structural chemical images into their corresponding InChI textual representations. Key details include:

- **Data Files:**  
  - Files: `train.csv`, `valid.csv`, and `test.csv`  
  - Contents: Each CSV contains three columns: `image_id`, `InChI`, and `SMILES`.

- **Image Repository:**  
  - All images are stored in the `images` folder, and filenames match the `image_id` values present in the CSV files.

- **Interface File:**  
  - The entry point for your solution is defined in `deepevolve_interface.py`.

---

## Evaluation Metric

The performance of your solution is measured by:

$$
\text{Score} = 1 - \text{mean Levenshtein distance}
$$

This metric rewards submissions that minimize the average edit distance between the predicted and actual InChI strings.

---

## Initial Method Proposal: ResNet+GRU

### Overview

The proposed method, titled **ResNet+GRU**, formulates molecular translation as an image-to-sequence problem. The approach involves leveraging a deep convolutional network and a recurrent decoder to accurately convert chemical imagery into InChI strings.

### Method Details

1. **Visual Feature Extraction:**  
   - A deep convolutional backbone (e.g., ResNet) processes each image to extract a fixed-length feature vector.

2. **Recurrent Decoding:**  
   - This feature vector initializes a GRU (Gated Recurrent Unit) decoder.
   - The decoder generates the InChI string one character at a time.

3. **Character-Level Vocabulary:**  
   - The model constructs a character-level vocabulary that includes special tokens for start, end, and padding.

4. **Training Objective:**  
   - The network is trained end-to-end using cross-entropy loss, aligning the predicted sequence with the true InChI token sequence.
   - The loss function is defined as:

     $$
     \mathcal{L} = - \sum_{t=1}^{T} \log P(y_t | y_{<t}, \mathbf{x})
     $$

     where:
     - \( \mathbf{x} \) represents the image features,
     - \( y_t \) is the token at time step \( t \),
     - \( T \) is the total length of the sequence.

5. **Stabilization Techniques:**  
   - The training process incorporates decoder dropout, gradient clipping, and a cosine learning-rate schedule.
   - Model checkpoints are selected based on the validation edit distance between the generated and reference InChI strings.

6. **Decoding:**  
   - At test time, a greedy decoding strategy is employed until the end marker is reached, ensuring full InChI string recovery.

### Supplementary Material

For an in-depth starter guide and additional implementation insights, please refer to the following notebook:

[InChI ResNet LSTM with Attention Starter Notebook](https://www.kaggle.com/code/yasufuminakama/inchi-resnet-lstm-with-attention-starter/notebook)

---

## Getting Started

1. **Clone the Repository:**  
   Clone the project repository to your local machine.

2. **Review the Interface:**  
   Examine the `deepevolve_interface.py` file to understand the required input and output formats for your solution.

3. **Dataset Familiarization:**  
   Understand the dataset structure by reviewing `train.csv`, `valid.csv`, and `test.csv`, and inspect the images in the `images` folder.

4. **Develop & Test:**  
   Implement your model based on your preferred machine learning framework. Ensure your solution adheres to the contest guidelines and evaluation protocol.
