# HANCOCK Outcome Prediction (LR / WDRO / WDRO_MRO_GAME)

## Overview
This repository contains code to train and evaluate baseline models for outcome prediction on the HANCOCK multimodal dataset. The models include Logistic Regression (LR), Wasserstein Distributionally Robust Optimization (WDRO), and a game-theoretic variant (WDRO_MRO_GAME). The main script reads precomputed clinical, pathological, ICD, blood, and TMA cell density features, simulates label noise at user-specified rates, and reports AUC metrics with repeated seeds. Results and figures are written to a timestamped output directory.

## Repository layout
```
iclr_799/
├─ env.yml                 # Conda environment (Python 3.11, cvxpy, scikit-learn, mosek, etc.)
├─ main.py                 # Entry point for experiments
├─ sub.sh                  # Example run script (sets env vars and calls main.py)
└─ data-open.zip           # Dataset bundle; unzip to ./data-open/
```

After unzipping `data-open.zip`, the data directory will contain:
- `HANCOCK_MultimodalDataset-main/features/*.csv` (clinical.csv, pathological.csv, blood.csv, icd_codes.csv, tma_cell_density.csv, targets.csv)
- `Hancock_Dataset/DataSplits_DataDictionaries/*` (data dictionaries and splits)

## Environment setup
A Conda file is provided.
```bash
conda env create -f env.yml
conda activate wdro_mro
```

### MOSEK license (required for CVXPY solvers)
Set the `MOSEKLM_LICENSE_FILE` environment variable to a valid license path before running. For example:
```bash
export MOSEKLM_LICENSE_FILE=/path/to/mosek.lic
```
The script prints the current value at start-up to help with debugging.

## Data preparation
Unzip the provided data bundle at the repository root so that `./data-open/` exists:
```bash
unzip data-open.zip -d .
# This creates: ./data-open/HANCOCK_MultimodalDataset-main/features/*.csv, etc.
```

The main script expects the CSVs at:
```
data-open/HANCOCK_MultimodalDataset-main/features/clinical.csv
data-open/HANCOCK_MultimodalDataset-main/features/pathological.csv
data-open/HANCOCK_MultimodalDataset-main/features/blood.csv
data-open/HANCOCK_MultimodalDataset-main/features/icd_codes.csv
data-open/HANCOCK_MultimodalDataset-main/features/tma_cell_density.csv
data-open/HANCOCK_MultimodalDataset-main/features/targets.csv
```

## Quick start
You can use the provided shell script:
```bash
bash sub.sh
```

`sub.sh` sets the following environment variables before calling `python main.py`:
```bash
export MODEL_KEYS="LR,WDRO,WDRO_MRO_GAME"
export WDRO_MRO_T_VALUE=9
export WDRO_MRO_GAMMA_VALUE=0.5
export REPEATED_SEED_ITERATION=5
export WDRO_EPS=0.02
export NOISE_RATES="0.0,0.1,0.2,0.3,0.4,0.5"
python main.py
```

You can also run `main.py` directly and override variables on the command line, for example:
```bash
MOSEKLM_LICENSE_FILE=/path/to/mosek.lic \
MODEL_KEYS="LR,WDRO" \
WDRO_EPS=0.05 \
REPEATED_SEED_ITERATION=3 \
python main.py
```

## Configuration (environment variables)
- `MODEL_KEYS` (default: `"LR,WDRO,WDRO_MRO_GAME"`): Comma-separated model list to run.
- `WDRO_EPS` (default: `0.05`): Wasserstein ambiguity radius.
- `WDRO_MRO_T_VALUE` (default: `20`): Iteration count or horizon parameter for WDRO_MRO_GAME.
- `REPEATED_SEED_ITERATION` (default: `5`): Number of repeated runs with different seeds.
- `NOISE_RATES` (default: `"0.0,0.1,0.2,0.3,0.4,0.5"`): Label noise rates for sensitivity analysis.

## Models
- **LR**: `sklearn.linear_model.LogisticRegression` with `liblinear`, `max_iter=2000`, and `class_weight="balanced"`.
- **WDRO**: Wasserstein DRO classifier implemented on top of CVXPY and MOSEK.
- **WDRO_MRO_GAME**: A game-style extension with parameters `T` and `gamma`.


## Outputs
Results are written under:
```
outputs3/YYYYMMDD_HHMMSS/
```
Saved files include:
- AUC summaries (CSV).
- Visualizations (for example, heatmaps) saved as image files.

## Reproducibility
- Seeds: controlled via `REPEATED_SEED_ITERATION`.
- Data types: patient IDs are loaded as strings for stable joins across feature tables.

## Troubleshooting
- **MOSEK not found**: check `MOSEKLM_LICENSE_FILE` and the MOSEK installation.
- **Missing CSVs**: ensure `data-open.zip` is unzipped correctly at the repository root.

