# Supplementary Dataset for MolLangBench

This repository contains the supplementary dataset extracted for review purposes, organized into three main task categories: recognition, generation, and edit tasks.

## Important Notes

1. **Recognition Tasks**: Due to supplementary materials size limits, molecule images are not included for recognition tasks. However, these can be easily generated from the SMILES strings using RDKit.

2. **Generation and Edit Tasks**: Complete datasets are provided including molecule images, descriptions, and editing instructions in organized sample folders.

## Directory Structure

```
supplementary_datasets/
├── recognition/
│   ├── aldehyde/
│   │   ├── test/test.csv
│   │   └── train/train.csv
│   ├── benzene/
│   │   ├── test/test.csv
│   │   └── train/train.csv
│   ├── ... (23 recognition subtasks total)
│   └── two_hop_neighbors/
│       ├── test/test.csv
│       └── train/train.csv
├── generation/
│   ├── core/
│   │   ├── core.csv
│   │   ├── sample_000/
│   │   │   ├── molecule_0.png
│   │   │   └── description.txt
│   │   ├── sample_001/
│   │   │   ├── molecule_1.png
│   │   │   └── description.txt
│   │   └── ... (200 samples total)
│   └── extended/
│       ├── extended.csv
│       ├── sample_000/
│       │   ├── molecule_0.png
│       │   └── description.txt
│       ├── sample_001/
│       │   ├── molecule_1.png
│       │   └── description.txt
│       └── ... (200 samples total)
└── edit/
    ├── core/
    │   ├── core.csv
    │   ├── sample_000/
    │   │   ├── original_molecule_0.png
    │   │   ├── edited_molecule_0.png
    │   │   └── edit_instruction.txt
    │   ├── sample_001/
    │   │   ├── original_molecule_1.png
    │   │   ├── edited_molecule_1.png
    │   │   └── edit_instruction.txt
    │   └── ... (200 samples total)
    └── extended/
        ├── extended.csv
        ├── sample_000/
        │   ├── original_molecule_0.png
        │   ├── edited_molecule_0.png
        │   └── edit_instruction.txt
        ├── sample_001/
        │   ├── original_molecule_1.png
        │   ├── edited_molecule_1.png
        │   └── edit_instruction.txt
        └── ... (200 samples total)
```

## Dataset Description

### Recognition Tasks
- **23 subtasks** covering functional groups, ring systems, and structural features
- **CSV Format**: `smiles,target_atoms,result_1,result_2,note`
- **Content**: Only CSV files (images can be generated from SMILES using RDKit)

### Generation Tasks
- **Core Dataset**: 200 samples with molecule images and structure descriptions
- **Extended Dataset**: 200 additional samples
- **CSV Format**: `smiles,structure_description`
- **Sample folders**: Each contains molecule image and description text file

### Edit Tasks
- **Core Dataset**: 200 samples with molecular editing tasks
- **Extended Dataset**: 200 additional samples  
- **CSV Format**: `original_smiles,edit_instructions,edited_smiles`
- **Sample folders**: Each contains original/edited molecule images and instruction text file