# Conformal Nested testing for Molecule Generation

This package implements sequential testing methods for evaluating molecule generation models.

## Installation

```bash
# Clone the repository
cd sequential_testing

# Install dependencies
pip install -r requirements.txt
```

## Usage

### Using Configuration Files

The recommended way to run the sequential testing is using configuration files:

```bash
# Run with a specific configuration file
python -m src.main --config config/default_config.json

```


### Command Line Arguments

You can also specify all options via command line arguments:

```bash
python -m src.main \
    --val_data_path data/validation_data.csv \
    --test_data_path data/test_data.csv \
    --output_dir results \
    --feature_model_type binary \
    --feature_model_checkpoint checkpoints/feature_model.pt \
    --kde_type negative_only \
    --kde_bandwidths 0.01 0.05 0.1 0.5 1.0 \
    --property_threshold 0.9 \
    --similarity_threshold 0.2 \
    --calibration_type negative \
    --alpha 0.5,0.1 \
    --permutations 1000 \
    --statistic min \
    --seed 42
``` 

## Configuration Parameters

The following parameters can be specified in the configuration file or as command line arguments:

### Data Paths
- `val_data_path`: Path to validation data CSV
- `test_data_path`: Path to test data CSV
- `output_dir`: Directory to save results

### Model Configuration
- `feature_model_type`: Type of model for feature extraction (`binary` or `multicomponent`)
- `feature_model_checkpoint`: Path to feature extraction model checkpoint
- `scoring_model_type`: Type of model for scoring (defaults to feature model type)
- `scoring_model_checkpoint`: Path to scoring model checkpoint (if different from feature model)

### KDE Parameters
- `kde_bandwidths`: List of bandwidths to try for KDE
- `density_threshold`: Percentile threshold for density filtering

### Filtering Thresholds
- `property_threshold`: Threshold for property value
- `similarity_threshold`: Threshold for similarity

### Testing Parameters
- `alpha`: Comma-separated list of significance levels
- `max_samples`: Maximum number of total samples to test
- `max_samples_per_group`: Maximum number of samples per unique SMILES_ori
- `permutations`: Number of permutations for testing


### Other Parameters
- `seed`: Random seed for reproducibility

## Example Configurations

The package includes example configuration files:

- `default_config.json`: Standard configuration using uniform sampling and automatic permutation method

## Output

The results of each run are saved to the specified output directory, including:
- Configuration settings
- Group-level results
- Aggregate metrics
- Density distributions
- P-value distributions

## Data Format

### Validation Data CSV
Should contain columns:
- `smiles_low`: SMILES string for first molecule
- `smiles_high`: SMILES string for second molecule
- `label`: Binary label (0 for negative, 1 for positive)

### Test Data CSV
Should contain columns:
- `SMILES_ori`: Original SMILES string
- `SMILES_opt`: Optimized SMILES string
- `PROPERTY_opt`: Property value of optimized molecule
- `SIMILARITY_ori_opt`: Similarity between original and optimized molecules
