# PoGE: Ensuring Physicochemical Fidelity of Generated Polymers

**Anonymous Authors**  
*Submitted to ICLR 2026*

This repository contains the implementation of PoGE (Polymer Generation Engine), a GPT-2 based model for generating chemically valid polymer SMILES strings with high physicochemical fidelity.

## Overview

PoGE is a transformer-based model trained on large-scale polymer datasets to generate chemically valid polymer structures. The model uses a custom BPE tokenizer and incorporates polymer-specific validation to ensure generated structures meet chemical constraints.

## Installation

### Using Conda (Recommended)

```bash
# Create and activate the environment
conda env create -f environment.yml
conda activate poge
```

### Using Pip

```bash
pip install -r requirements.txt
```

## Quick Start

### Generating Polymers

Generate polymer SMILES using the pre-trained model:

```bash
python generate.py -n 1000 --output generated_polymers.json
```

This will generate 1000 polymer SMILES and save them to `generated_polymers.json`.

### Evaluating Generated Polymers

Use the comprehensive metrics suite to evaluate generated polymers:

```python
from src.metrics import get_all_metrics
import json

# Load generated polymers
with open('generated_polymers.json', 'r') as f:
    generated_smiles = json.load(f)

# Load test set (replace with your test data)
test_smiles = [...]  # Your test polymer SMILES

# Compute all metrics
metrics = get_all_metrics(
    gen=generated_smiles,
    test=test_smiles,
    k=[1000, 10000],  # unique@k values
    n_jobs=4,  # parallel processing
    device='cpu'
)

print("Evaluation Results:")
for metric_name, value in metrics.items():
    print(f"{metric_name}: {value:.4f}")
```

## Available Metrics

The `get_all_metrics` function computes the following metrics:

### Basic Quality Metrics
- **valid**: Fraction of valid SMILES strings
- **p-valid**: Fraction of valid polymer SMILES (with exactly 2 attachment points)
- **unique@k**: Fraction of unique molecules in first k molecules
- **Novel**: Fraction of novel molecules not in training set

### Diversity Metrics
- **SNN**: Similarity to nearest neighbor
- **IntDiv**: Internal diversity

### Physicochemical Property Distributions Comparison
- **molar_mass**: Molecular weight distribution comparison
- **aromatic_fraction**: Aromatic atom fraction distribution
- **rotatable_bond_fraction**: Rotatable bond fraction distribution
- **heteroatom_fraction**: Heteroatom fraction distribution
- **tpsa**: Topological polar surface area distribution

## Model Architecture

- **Base Model**: GPT-2 (6 layers, 8 heads, 256 embedding dimension)
- **Vocabulary Size**: 546 tokens
- **Tokenizer**: Custom BPE tokenizer trained on polymer SMILES
- **Training**: Two-stage training (pretraining + supervised fine-tuning)

## Data

The repository includes:
- Pre-trained model weights in `data/sft/`
- BPE tokenizer model in `data/smiles_bpe_tokenizer_543.model`
- Sample generated polymers in `poge_10m_1_shard.json.tar.gz` and `poge_10m_2_shard.json.tar.gz`

## File Structure

```
poge/
├── src/
│   ├── tokenizer.py      # Custom BPE tokenizer wrapper
│   ├── metrics.py        # Comprehensive evaluation metrics
│   ├── utils.py          # Polymer validation utilities
│   ├── pretrain.py       # Pretraining script
│   └── sft.py           # Supervised fine-tuning script
├── data/
│   ├── sft/             # Fine-tuned model weights
│   └── smiles_bpe_tokenizer_543.model  # Tokenizer
├── generate.py          # Polymer generation script
├── requirements.txt     # Python dependencies
├── environment.yml     # Conda environment
└── README.md          # This file
```

## Usage Examples

### Generate Polymers with Custom Parameters

```python
from src.tokenizer import TokenizerWrapper
from transformers import GPT2LMHeadModel
import torch
import json

# Load model and tokenizer
tokenizer = TokenizerWrapper('data/smiles_bpe_tokenizer_543.model')
model = GPT2LMHeadModel.from_pretrained('data/sft/').eval()

# Custom generation parameters
generation_params = {
    'top_p': 0.9,
    'top_k': 50,
    'temperature': 1.0,
    'max_new_tokens': 200
}

# Generate polymers
generated_smiles = []
for _ in range(100):
    generated_smiles.append(
        tokenizer.decode(
            model.generate(
                tokenizer.tokenizer.bos_id() * torch.ones((1, 1), dtype=torch.int64),
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
                **generation_params
            ).cpu()
        )[0]
    )

# Save results
with open('custom_generated.json', 'w') as f:
    json.dump(generated_smiles, f)
```

### Validate Polymer Structures

```python
from src.utils import is_polymer

# Check if a SMILES is a valid polymer
smiles = "C*CC*C"  # Example polymer SMILES
is_valid = is_polymer(smiles)
print(f"Is valid polymer: {is_valid}")
```

## Citation

If you use this code in your research, please cite:

```bibtex
@article{anonymous2026poge,
  title={Ensuring Physicochemical Fidelity of Generated Polymers with PoGE},
  author={Anonymous Authors},
  journal={ICLR},
  year={2026}
}
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Contact

For questions about this implementation, please contact the anonymous authors through the ICLR submission system.
