# Raindrop for CD Dataset

This directory contains the complete implementation of Raindrop adapted for the CD (Crohn's Disease) dataset.

## Overview

Raindrop is a graph-guided neural network for irregularly sampled multivariate time series. This implementation adapts Raindrop to work with the CD dataset, which was originally designed for graph-based models.

## Files

### Core Implementation
- `CD_preprocessing.py` - Converts CD dataset to Raindrop format
- `Raindrop_CD.py` - Complete Raindrop training script for CD dataset
- `run_raindrop_cd.py` - Simple runner script for the complete pipeline
- `test_cd_preprocessing.py` - Test script to verify preprocessing

### Modified Utilities
- `utils_rd.py` - Updated utilities with CD dataset support

## Quick Start

### 1. Run Complete Pipeline
```bash
cd Raindrop/code
python3 run_raindrop_cd.py
```

This will:
1. Run preprocessing if CD data doesn't exist
2. Train Raindrop on CD dataset
3. Report results across 5 splits

### 2. Step-by-Step Execution

#### Preprocessing
```bash
cd Raindrop/code
python3 CD_preprocessing.py
```

#### Training
```bash
cd Raindrop/code
python3 Raindrop_CD.py --dataset CD --withmissingratio False --splittype random --feature_removal_level no_removal
```

#### Testing
```bash
cd Raindrop/code
python3 test_cd_preprocessing.py
```

## Configuration

### CD Dataset Parameters
- **Biomarkers**: 17 biomarkers (npu02593, npu02902, etc.)
- **Static Features**: 2 (age_norm, sex)
- **Time Steps**: 50 (maximum per patient)
- **Classes**: 2 (patient vs control)
- **Splits**: 5 different train/val/test configurations

### Model Parameters
- **Learning Rate**: 0.0001
- **Epochs**: 20
- **Batch Size**: 128
- **Hidden Dimensions**: 2 * d_model
- **Layers**: 2
- **Heads**: 2
- **Dropout**: 0.2

## Data Format

### Input Format
The CD dataset is converted from graph-based to time series format:

**Original CD Format:**
```python
# Per patient: multiple biomarker-specific graphs
graph_dict = {
    'npu02593': (node_features, distances, label),
    'npu02902': (node_features, distances, label),
    # ... more biomarkers
}
```

**Raindrop CD Format:**
```python
# Per patient: unified time series
patient_dict = {
    'id': patient_id,
    'arr': time_series_matrix,  # shape: (50, 17)
    'time': timestamps,         # shape: (50, 1)
    'extended_static': static_features,  # shape: (2,)
    'label': binary_label       # 0=control, 1=patient
}
```

### Biomarkers
The 17 biomarkers included are:
- **White blood cell subtypes**: npu02593, npu02902, npu02636, npu02840, npu01933, npu01349
- **Inflammation markers**: npu19748, npu19717
- **Platelets**: npu03568
- **Hemoglobin**: npu02319
- **Iron**: npu02508
- **Vitamins and folate**: npu02070, npu01700, npu10267
- **Liver function markers**: npu19651, npu01370, npu19673

## Training Process

### 1. Data Preprocessing
- Load individual patient CSV files
- Aggregate biomarker measurements into time series matrices
- Handle missing values with NaN
- Normalize age, keep sex as categorical
- Create 5 different train/val/test splits

### 2. Model Training
- **Strategy**: Balanced sampling (strategy=2)
- **Loss**: CrossEntropyLoss
- **Optimizer**: Adam with learning rate 0.0001
- **Scheduler**: ReduceLROnPlateau
- **Validation**: Every epoch
- **Early Stopping**: Based on validation AUC

### 3. Evaluation
- **Metrics**: Accuracy, AUROC, AUPRC
- **Testing**: On best model from validation
- **Reporting**: Mean ± std across 5 splits

## Expected Results

The model should achieve:
- **Accuracy**: ~70-80%
- **AUROC**: ~75-85%
- **AUPRC**: ~70-80%

(Exact results depend on data quality and model convergence)

## Comparison with GMAN

| Aspect | GMAN (Original) | Raindrop (CD) |
|--------|----------------|---------------|
| **Data Structure** | Graph-based | Time series |
| **Model Type** | Graph Neural Network | Attention-based |
| **Biomarker Handling** | Separate graphs | Unified matrix |
| **Missing Data** | Graph structure | NaN values |
| **Training** | Graph batches | Time series batches |
| **Evaluation** | Graph-level | Patient-level |

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**
   - Reduce batch size in `Raindrop_CD.py`
   - Reduce `max_time_steps` in preprocessing

2. **Data Not Found**
   - Ensure CD data exists in `../../data/sequential_data/`
   - Run preprocessing first

3. **Import Errors**
   - Install requirements: `pip install -r ../requirements.txt`
   - Check Python version (3.6+)

4. **Model Convergence**
   - Adjust learning rate
   - Increase epochs
   - Check data quality

5. **CPU-only Systems**
   - The implementation now supports CPU-only systems
   - Test CPU compatibility: `python3 test_cpu_compatibility.py`
   - Training will be slower on CPU but fully functional

### Debug Mode
```bash
# Test preprocessing only
python3 test_cd_preprocessing.py

# Run with verbose output
python3 Raindrop_CD.py --dataset CD --withmissingratio False --splittype random --feature_removal_level no_removal 2>&1 | tee raindrop_cd.log
```

## File Structure

```
Raindrop/
├── code/
│   ├── CD_preprocessing.py      # Data preprocessing
│   ├── Raindrop_CD.py          # Main training script
│   ├── run_raindrop_cd.py      # Complete pipeline runner
│   ├── test_cd_preprocessing.py # Test script
│   ├── utils_rd.py             # Modified utilities
│   └── README_CD.md            # This file
├── CDdata/
│   ├── processed_data/          # Preprocessed data
│   └── splits/                  # Data splits
└── models/                      # Saved models
```

## Dependencies

- Python 3.6+
- PyTorch 1.9.0+
- NumPy, Pandas, Scikit-learn
- See `../requirements.txt` for full list

## Citation

If you use this implementation, please cite:
- Raindrop paper: Zhang et al. (2022) "Graph-Guided Network For Irregularly Sampled Multivariate Time Series"
- CD dataset: Your original CD dataset citation 